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Independent  random  effects  in  generalized  linear  models  induce  an  exchange- 
able correlation  structure,  but  long  sequences  of  counts  or  binomial  observations 
typically  show  correlations  decaying  with  increasing  lag.  This  dissertation  intro- 
duces models  with  autocorrelated  random  effects  for  a  more  appropriate,  parameter 
driven  analysis  of  discrete- valued  time  series  data.  We  present  a  Monte  Carlo  EM 
algorithm  with  Gibbs  sampling  to  jointly  obtain  maximum  likelihood  estimates 
of  regression  parameters  and  variance  components.  Marginal  mean,  variance  and 
correlation  properties  of  the  conditionally  specified  models  are  derived  for  Poisson, 
negative  binomial  and  binary/binomial  random  components.  They  are  used  for 
constructing  goodness  of  fit  tables  and  checking  the  appropriateness  of  the  modeled 
correlation  structure.  Our  models  define  a  likelihood  and  hence  estimation  of  the 
joint  probability  of  two  or  more  events  is  possible  and  used  in  predicting  future 
responses.  Also,  all  methods  are  flexible  enough  to  allow  for  multiple  gaps  or  miss- 
ing observations  in  the  observed  time  series.  The  approach  is  illustrated  with  the 
analysis  of  a  cross-sectional  study  over  30  years,  where  only  observations  from  16 
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unequally  spaced  years  are  available,  a  time  series  of  168  monthly  counts  of  polio 
infections  and  two  long  binary  time  series. 
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CHAPTER  1 
INTRODUCTION 

Correlated  discrete  data  arise  in  a  variety  of  settings  in  the  biomedical, 

social,  political  or  business  sciences  whenever  a  discrete  response  variable  is 

measured  repeatedly.  Examples  are  time  series  of  counts  or  longitudinal  studies 

measuring  a  binary  response.  Correlations  between  successive  observations  arise 

naturally  through  a  time,  space  or  some  other  cluster  forming  context  and  have 

to  be  incorporated  in  any  inferential  procedure.  Standard  regression  models 

for  independent  data  can  be  expanded  to  accommodate  such  correlations.  For 

continuous  type  responses,  the  normal  linear  mixed  effects  model  offers  such  a 

flexible  framework  and  has  been  well  studied  in  the  past.  A  recent  reference  is 

Verbeke  and  Molenberghs  (2000),  who  also  discuss  computer  software  for  fitting 

linear  mixed  effects  models  with  popular  statistical  packages.  Although  the  normal 

Unear  mixed  effects  model  is  but  one  member  of  the  broader  class  of  generalized 

linear  mixed  effects  models,  it  enjoys  unique  properties  which  simplify  parameter 

estimation  and  interpretation  substantially.  For  discrete  response  data,  however, 

the  normal  distribution  is  not  appropriate,  and  other  members  in  the  exponential 

family  of  distributions  have  to  be  considered. 

1.1     Regression  Models  for  Correlated  Discrete  Data 

In  this  introduction  we  will  review  extensions  of  the  basic  generalized  linear 

model  (McCullagh  and  Nelder,  1989)  for  analyzing  independent  observations  to 

models  for  correlated  data.  These  models  are  marginal  (Section  1.2),  transitional 

(Section  1.3)  and  random  effects  models  (Section  1.4).  An  extensive  discussion  of 

these  models  with  respect  to  discrete  longitudinal  data  is  given  in  the  books  by 


Agresti  (2002)  and  Diggle,  Heagerty,  Liang  and  Zeger  (2002).  In  general,  longi- 
tudinal studies  concern  only  a  few  repeated  measurements.  In  this  dissertation, 
however,  we  are  interested  in  the  analysis  of  much  longer  series  of  repeated  observa- 
tions, often  exceeding  100  repeated  measurements.  Therefore,  the  following  review 
focuses  specifically  on  models  for  univariate  time  series  observations,  some  of  which 
are  presented  in  Fahrmeir  and  Tutz  (2001). 

Let  yt  be  a  response  at  time  t,  t  =  1, . . . ,  T,  observed  together  with  a  vector 
of  covariates  denoted  by  Xt-  In  a  generalized  linear  model  (GLM),  the  mean 
Ht  =  E[yt]  of  observation  yt  depends  on  a  linear  predictor  Tjt  =  x[/3  through  a  link 
function  h{.),  forming  the  relationship  fit  =  h'^i^'t^)-  The  variance  of  yt  depends 
on  the  mean  through  the  relationship  var(yt)  =  (f)tu{fit),  where  v[.)  is  a  distribution 
specific  variance  function  and  {<j)t}  are  additional  dispersion  parameters.  In  a 
regular  GLM,  observations  at  any  two  distinct  time  points  t  and  t*  are  assumed 
independent. 

In  the  models  discussed  below,  the  type  of  extension  to  accommodate  corre- 
lated data  depends  on  the  way  the  correlation  is  introduced  into  the  model.  In 
marginal  models,  the  correlation  can  be  specified  directly,  e.g.,  corr(yt,r/(.)  =  p  or 
left  completely  unspecified,  but  nonetheless  accounted  for  in  likelihood  based  and 
non-likelihood  based  inferences.  In  transitional  models  correlation  is  introduced 
by  including  previous  observations  in  the  linear  predictor,  e.g.,  r/j  =  x[/9,  where 
Xt  =  [x[,  yt-i,  yt-2,  ■  ■  ■)'  and  ^  =  {/3,  ai,  0:2,  ■  •  •)  are  extensions  of  the  design  and 
parameter  vector  of  a  GLM  with  independent  components.  Random  effects  models 
induce  correlation  between  observations  by  including  random  eff"ects  rather  than 
previous  observations  in  the  linear  predictor,  e.g.,  rjt  =  x'tP  +  u,  where  m  is  a 
random  effect  shared  by  all  observations. 

The  way  correlation  is  built  into  a  model  also  determines  the  type  of  inference. 
Typically,  marginal  models  are  fitted  by  a  quasi-likelihood  approach,  estimation  in 


transitional  models  is  based  on  a  conditional  or  partial  likelihood,  and  inference 
in  random  effects  models  relies  on  a  full  likelihood  (possibly  Bayesian)  approach. 
However,  models  and  inferential  procedures  have  been  developed  that  allow  more 
flexibility  than  the  above  categorization. 

1.2     Marginal  Models 

In  marginal  regression  models,  the  main  scientific  goal  is  to  assess  the  influence 
of  covariates  on  the  marginal  mean  of  yt,  treating  the  association  structure  between 
repeated  observations  as  a  nuisance.  The  marginal  mean  Ht  and  variance  var(yt) 
are  modeled  separately  from  a  correlation  structure  between  two  observations 
yt  and  y^ .  Regression  parameters  in  the  linear  predictor  are  called  population- 
averaged  parameters,  because  their  interpretation  is  based  on  an  average  over 
all  individuals  in  a  specific  covariate  subgroup.  Due  to  the  correlation  among 
repeated  observations,  the  likelihood  for  the  model  refers  to  the  joint  distribution 
of  all  observations  and  not  to  the  simpler  product  of  their  marginal  distributions. 
However,  the  model  is  specified  in  terms  of  these  marginal  distributions,  which 
makes  maximum  likelihood  fitting  particulary  hard  for  even  a  moderate  number  T 
of  repeated  measurements. 
1.2.1      Likelihood  Based  Estimation  Methods 

For  binary  data,  Fitzmaurice  and  Laird  (1993)  discuss  a  parametrization  of  the 
joint  distribution  in  terms  of  conditional  probabilities  and  log  odds  ratios.  These 
parameters  are  related  to  the  marginal  mean  and  the  same  conditional  log  odds 
ratios,  which  describe  the  higher  order  associations  among  the  repeated  responses. 
The  marginal  mean  and  the  higher  order  associations  are  then  modelled  in  terms  of 
orthogonal  parameters  /3  and  a,  respectively.  Fitzmaurice  and  Laird  (1993)  present 
an  algorithm  for  maximizing  the  likelihood  with  respect  to  these  two  parameter 
sets.  The  algorithm  has  been  implemented  in  a  freely  available  computer  program 
(MAREG)  by  Kastner  et  al.  (1997). 


Another  approach  to  maximum  likeUhood  fitting  for  longitudinal  discrete 
data  regards  the  marginal  model  as  a  constraint  on  the  joint  distribution  and 
maximizes  the  likelihood  subject  to  this  constraint.  The  model  is  written  in  terms 
of  a  generalized  log-linear  model  C\og{Afi)  =  X/S,  where  ^  is  a  vector  of  expected 
counts  and  A  and  C  are  matrices  to  form  marginal  counts  and  functions  of  those 
marginal  counts,  respectively.  With  this  approach,  no  specific  assumption  about 
the  correlation  structure  of  repeated  observations  is  made,  and  the  likelihood 
refers  to  the  most  general  form  for  the  joint  distribution.  However,  simultaneous 
modeling  of  the  marginal  distribution  and  a  simplified  joint  distribution  is  also 
possible.  Details  can  be  found  in  Lang  and  Agresti  (1994)  and  Lang  (1996).  Lang 
(2004)  also  offers  an  R  computer  program  (mph.fit)  for  maximum  likelihood  fitting 
of  these  very  general  marginal  models. 
1.2.2     Quasi-Likelihood  Based  Estimation  Methods 

The  drawback  of  the  two  approaches  mentioned  above  and  likelihood  based 
methods  in  general  is  that  they  require  enormous  computing  resources  as  the 
number  of  repeated  responses  increases  or  the  number  of  covariates  is  large,  making 
maximum  likelihood  fitting  computationally  impossible  for  long  time  series.  This  is 
also  true  for  estimation  based  on  alternative  parameterizations  of  a  distribution  for 
multivariate  binary  data  such  as  those  discussed  in  Bahadur  (1961),  Cox  (1972)  or 
Zhao  and  Prentice  (1990). 

Estimating  methods  leading  to  computationally  simpler  inference  (albeit 
not  maximum  likelihood)  for  marginal  models  are  based  on  a  quasi-likelihood 
approach  (Wedderburn,  1974).  In  a  quasi-likelihood  approach,  no  specific  form 
for  the  distribution  of  the  responses  is  assumed  and  only  the  mean,  variance  and 
correlation  are  specified.  However,  with  discrete  data,  specifying  the  mean  and 
covariances  does  not  determine  the  likelihood,  as  it  would  with  normal  data,  so 
parameter  estimation  cannot  be  based  on  it.  Liang  and  Zeger  (1986)  proposed 
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generalized  estimating  equations  (GEE)  to  estimate  parameters,  which  have 
the  form  of  score  equations  for  GLMs,  but  cannot  be  interpreted  as  such.  Their 
approach  also  requires  the  specification  of  a  working  correlation  matrix  for  the 
repeated  responses.  They  show  that  if  the  mean  function  is  correctly  specified, 
the  solution  to  the  generalized  estimating  equations  is  a  consistent  estimator, 
regardless  of  the  assumed  variance-covariance  structure  for  the  repeated  responses. 
They  also  present  an  estimator  of  the  asymptotic  variance-covariance  matrix 
for  the  GEE  estimates,  which  is  robust  against  misspecification  of  the  working 
correlation  matrix.  Several  structured  working  correlation  matrices  have  been 
proposed  for  parsimonious  modeling  of  the  marginal  correlation,  and  some  of  them 
are  implemented  in  statistical  software  packages  for  GEE  estimation  (e.g.,  SAS's 
proc  genmod  with  the  repeated  statement  and  the  type  option  or  the  gee  and  geel 
packages  in  R). 
1.2.2.1     GEE  for  time  series  of  counts 

Zeger  (1988)  uses  the  GEE  methodology  to  fit  a  marginal  model  to  a  time 
series  {2/t}^i  of  T  =  168  monthly  counts  of  cases  of  poliomyelitis  in  the  United 
States.  He  specifies  the  marginal  mean,  variance  and  correlation  by 

Ht    =    exp(xt^) 
var(t/i)    =    ^t  +  aVt     ^  (1-1) 

coTv{yt,yt+r)    = 


[{l  +  (a2;.,)-^}{l  +  (^Vt+.)-4]^/2' 
where  a^  is  the  variance  and  p{t)  the  autocorrelation  function  of  an  underlying 
random  process  {««}■  To  fit  this  marginal  model,  he  proposes  and  outlines  the 
GEE  approach,  but  notes  that  it  requires  inversion  of  the  T  xT  variance-covariance 
matrix  of  yi, . . . ,  yx,  which  has  no  recognizable  structure  and  therefore  no  simple 
inverse.  Subsequently,  he  suggests  approximating  this  matrix  by  a  simpler,  struc- 
tured matrix,  leading  to  nearly  as  efficient  estimators  as  would  have  been  obtained 
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with  the  GEE  approach.  The  variance  component  a^  and  unknown  parameters  in 
p(r)  are  estimated  by  a  methods  of  moments  approach. 

Interestingly,  Zeger  (1988)  derives  the  marginal  mean,  variance  and  correlation 
in  (1.1)  from  a  random  effects  model  specification:  Conditional  on  an  underlying 
latent  random  process  {«J  with  £'[ui]  =  1  and  cov(ut,Ut+T)  =  cr^p{T),  he  initially 
models  the  time  series  observations  as  conditionally  independent  Poisson  variables 
with  mean  and  variance 

E[yt  I  ut]  =  var(?/t  |  Ut)  =  exp{x[l3)ut.  (1.2) 

Marginally,  by  the  formula  for  repeated  expectation,  this  leads  to  the  moments 
presented  in  (1.1).  From  there  we  also  see  that  the  latent  random  process  {ut}  has 
introduced  both  overdispersion  relative  to  a  Poisson  variable  and  autocorrelation 
among  the  observations.  The  models  we  will  develop  in  subsequent  chapters  have 
similar  features.  The  equation  for  the  marginal  correlation  between  yt  and  y^ 
shows  that  the  autocorrelation  in  the  observed  time  series  must  be  less  than  the 
autocorrelation  in  the  latent  process  {ut}.  We  will  return  to  the  polio  data  set  in 
Chapter  5,  where  we  compare  this  model  to  models  suggested  in  this  dissertation 
and  elsewhere. 
1.2.2.2     GEE  for  binomial  time  series 

For  binary  and  binomial  time  series  data,  it  is  often  more  advantageous  to 
model  the  association  between  observations  using  the  odds  ratio  rather  than 
directly  specifying  the  marginal  correlation  covT{yt,yf)  as  with  count  data. 
The  odds  ratio  is  a  more  natural  metric  to  measure  association  between  binary 
outcomes  and  easier  to  interpret.  The  correlation  between  two  binary  outcomes 
Fi  and  Y2  is  also  constrained  in  a  complicated  way  by  their  marginal  means 
fii  =  P{Yi  =  1)  and  /i2  =  ^(^2  =  1)  as  a  consequence  of  the  following  inequalities 


for  their  joint  distribution: 

P{Yi  =  1,  Fa  =  1)  =  //I  +  //2  -  P{Yi  =  lovY2  =  l)>  max{0,  //i  +  /X2  -  1} 

and 

P(Fi  =  l,r2  =  l)<min{Mi,/i2}, 
leading  to  "  ■-  -  ' 

max{0,  A*!  +  A*2  -  1}  <  PiYi  =  1,  >2  =  1)  <  mm{fii,H2}- 

Therefore,  instead  of  marginal  correlations,  a  number  of  authors  (Fitzmaurice, 
Laird  and  Rotnitzky,  1993;  Carey,  Zeger  and  Diggle,  1993)  propose  the  use  of 
marginal  odds  ratios.  For  unequally  spaced  and  unbalanced  binary  time  series  data, 
Fitzmaurice  and  Lipsitz  (1995)  present  a  GEE  approach  which  models  the  marginal 
association  using  serial  odds  ratio  patterns.  Let  tl^tf  denote  the  marginal  odds  ratio 
between  two  binary  observations  yt  and  y^ .  Their  model  for  the  association  has  the 
form 

^fj.  =Q!^/I*-*'I,  1  <a  <  CO, 

which  has  the  property  that  as  |f  -  r|  — >  0,  there  is  perfect  association  (■0^.  -4  oo), 
and  as  \t  —  t*\  ^  oo,  the  observations  are  independent  (■0tt*  — >■  !)■  Note,  however, 
that  only  positive  association  is  possible  with  this  type  of  model.  (SAS's  proc 
genmod  now  offers  the  possibility  of  specifying  a  general  regression  structure  for  the 
log  odds  ratios  with  the  logor  option.) 

1.3     Transitional  Models 
In  transitional  models,  past  observations  are  simply  treated  as  additional 
predictors.  Interest  lies  in  estimating  the  effects  of  these  and  other  explanatory 
variables  on  the  conditional  mean  of  the  response  yt,  given  realizations  of  the  past 
responses.  Specifying  the  relationship  between  the  mean  of  yt  and  previous  obser- 
vations yt-i,  2/t-2)  •  •  •  is  another  way  (and  in  contrast  to  the  direct  way  of  marginal 
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models)  of  modeling  the  dependency  between  correlated  responses.  Transitional 
models  fit  into  the  framework  of  GLMs,  where,  however,  the  distribution  of  yt  is 
now  conditional  on  the  past  responses.  The  model  in  its  most  general  form  (Diggle 
et  al.,  2002)  expresses  the  conditional  mean  of  y^  as  a  function  of  explanatory 
variables  and  q  functions  /r(.)  of  past  responses, 

E[y,  I  Ht]  =  h-'  L;/3  +  J2  MHu  «)  J  ,  (1-3) 

where  Ht  =  {vt-i,  yt-2,  ■  ■■ ,  2/i}  denotes  the  collection  of  past  responses.  Ht  can 
also  include  past  explanatory  variables  and  parameters.  Often,  the  models  are  in 
discrete-time  Markov  chain  form  of  order  q,  and  the  conditional  distribution  of 
yt  given  Ht  only  depends  on  the  last  q  responses  yt-i, . . . ,  yt-q.  For  example,  a 
transitional  logistic  regression  model  for  binary  responses  that  is  a  second  order 
Markov  chain  has  form 

logit  P{Yt  =  1  I  yt-i,  yt-2)  =  x[l3  +  aiyt-i  +  Oi2yt-2- 

The  main  difference  between  transitional  models  and  regular  GLMs  or  marginal 
models  is  parameter  interpretation.  Both  the  interpretation  of  a  and  the  inter- 
pretation of  j3  are  conditional  on  previous  outcomes  and  depend  on  how  many 
of  these  are  included.  As  the  time  dependence  in  the  model  changes,  so  does  the 
interpretation  of  parameters.  With  the  logistic  regression  example  from  above,  the 
conditional  odds  of  success  at  time  t  are  exp(Q:i)  times  higher  if  the  given  previous 
response  was  a  success  rather  than  a  failure.  However,  this  interpretation  assumes 
a  fixed  and  given  outcome  at  time  t  -  2.  Similarly,  a  coefficient  in  /3  represents 
the  change  in  the  log  odds  for  a  unit  change  in  Xt,  conditional  on  the  two  prior 
responses.  It  might  be  possible  that  we  lose  information  on  the  covariate  effect  by 
conditioning  on  these  previous  outcomes.  In  general,  the  interpretation  of  parame- 
ters in  transitional  models  is  different  from  the  population  averaged  interpretation 


we  discussed  for  marginal  models,  where  parameters  are  effects  on  the  marginal 
mean  without  conditioning  on  any  previous  outcomes. 
1.3.1     Model  Fitting 

If  a  discrete-time  Markov  model  applies,  the  likelihood  for  a  generic  series 
2/1, . . . ,  2/T  is  determined  by  the  Markov  chain  structure: 

T 

L(^,  a;  ?/i, . . . ,  i/r)  =  /(yi, . . . ,  y,)   JJ  /(y*  |  yt_^, ...,  yt_q). 

t=q+l 

However,  the  transitional  model  (1.3)  only  specifies  the  conditional  distributions 
appearing  in  the  product,  but  not  the  first  term  of  the  likelihood.  Often,  instead 
of  a  full  maximum  likelihood  approach,  one  conditions  on  the  first  q  observations 
and  maximizes  the  corresponding  conditional  likelihood.  If  in  addition  fr{Ht;a) 
in  (1.3)  is  a  linear  function  in  a  (and  possibly  ^),  then  maximization  follows 
along  the  lines  of  GLMs  for  independent  data.  Kaufmann  (1987)  establishes  the 
asymptotic  properties  such  as  consistency,  asymptotic  normality  and  eflSciency  of 
the  conditional  maximum  likelihood  estimator. 

If  a  Markov  assumption  is  not  warranted,  estimation  can  be  based  on  the 
partial  likelihood  (Cox,  1975).  To  motivate  the  partial  likelihood  approach,  we 
follow  Kedem  and  Fokianos  (2002):  They  consider  occasions  where  a  time  series 
{Yt}  is  observed  jointly  with  a  random  covariate  series  {Xt}.  The  joint  density  of 
i^t,  ^t),  t  —  1,. ..  ,T,  parameterized  by  a  vector  0,  can  be  expressed  as 


f{xi,yi,...,XT,yT;0)  =  fixi;0) 


n/(^*l^*;^) 


.t-2 


llf{yt\Ht;e) 


.t=i 


(1.4) 


where  Ht  =  {xi,yi,...,  xt-i,  yt-i)  and  Ht  =  (xi,  yi, . . . ,  Xt_i,  yt_i,  Xt)  hold  the 
history  up  to  time  points  ^  -  1  and  t,  respectively.  Let  Tt-i  denote  the  cr-field 
generated  by  Yt^i,Yt^2,  ■  ■  ■  ,Xt, Xt-i, . . .,  i.e.,  Tt-i  is  generated  by  past  responses 
and  present  and  past  values  of  the  covariates.  Also,  let  ft{yt  \  Tt-uO)  denote  the 
conditional  density  of  Yt,  given  Tt-i,  which  is  of  exponential  density  form  with 
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mean  modeled  by  (1.3).  Then,  the  partial  likelihood  for  0  =  {oc,^)  is  given  by 

T 

PL{0; yu...,yT)  =  llMyt\  ^t-i; 0),  (1.5) 


«=i 


which  is  the  second  product  in  (1.4)  and  hence  the  term  partial.  The  loss  of 
information  by  ignoring  the  first  product  in  the  joint  density  is  considered  small. 
If  the  covariate  process  is  deterministic,  then  the  partial  likelihood  becomes  a 
conditional  likelihood,  but  without  the  necessity  of  a  Markov  assumption  on  the 
distribution  of  the  Yt's. 

Standard  asymptotic  results  from  likelihood  analysis  of  independent  data  carry 
over  to  the  case  of  partial  likelihood  estimation  with  dependent  data.  Fokianos  and 
Kedem  (1998)  showed  consistency  and  asymptotic  normality  of  0  and  provided  an 
expression  for  the  asymptotic  covariance  matrix.  Since  the  score  equation  obtained 
from  (1.5)  is  identical  to  one  for  independent  data  in  a  GLM,  partial  likelihood 
holds  the  advantage  of  easy,  fast  and  readily  available  software  implementation 
with  standard  estimation  routines  such  as  iterative  re-weighted  least  squares. 
1.3.2     Transitional  Models  for  Time  Series  of  Counts 

For  a  time  series  of  counts  {?/<},  Zeger  and  Qaqish  (1988)  propose  Markov- 
type  transitional  models  which  they  fit  using  quasi-likelihood  methods  and  the 
estimating  equations  approach.  They  consider  various  models  for  the  conditional 
mean  nt  =  E[yt  \  Ht]  of  form  log(/xt)  =  x[0  +  YIU^  a,/,(i/(_,),  where  for  example 
fr{Ht-r)  =  yt-T  or  fr{Ht-r)  =  \og{yt-r  +  c)  -  log(exp[a;J/9]  +  c).  One  common  goal 
of  their  models  is  to  approximate  the  marginal  mean  by  E[yt]  =  E[tit]  «  exp{x[l3} 
so  that  13  has  an  approximate  marginal  interpretation  as  the  change  in  the  log 
mean  for  a  unit  change  in  the  explanatory  variables.  Davis  et  al.  (2003)  develop 
these  models  further  and  propose  fr{Ht-r)  =  [yt-r  -  fJ.t-r)/fitr  as  a  more 
appropriate  function  to  built  serial  dependence  in  the  model,  where  A  is  an  I 

additional  parameter.  They  explore  stability  properties  such  as  stationarity  and 
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ergodicity  of  these  models  and  describe  fast  (in  comparison  to  maximum  likelihood 
techniques  required  for  competing  random  effects  models),  recursive  and  iterative 
maximum  likelihood  estimation  algorithms. 

Chapter  4  in  Kedem  and  Fokianos  (2002)  discusses  regression  models  of  form 
(1.3)  assuming  a  conditional  Poisson  or  double-truncated  Poisson  distribution 
for  the  counts,  with  inference  based  on  the  partial  likelihood  concept.  Their 
methodology  is  illustrated  with  two  examples  about  monthly  counts  of  rainy  days 
and  counts  of  tourist  arrivals. 
1.3.3     Transitional  Models  for  Binary  Data 

For  binary  data  {yt},  a  two  state,  first  order  Markov  chain  can  be  defined  by 
its  probability  transition  matrix 

Poo    Poi 
Pio    Pu 

where  pab  =  P{Yt  =  b  |  Ft-i  =  o),  a,  &  =  0, 1  are  the  one-step  transition  probabilities 
between  the  two  states  a  and  b.  Diggle  et  al.  (2002,  Chapt.  10.3)  discuss  various 
logistic  regression  models  for  these  probabilities  and  higher  order  Markov  chains  for 
equally  spaced  observations.  Unequally  spaced  data  cannot  be  routinely  handled 
with  these  models. 

How  can  we  determine  the  marginal  association  structure  implied  by  the 
conditionally  specified  model?  Let  p^  —  {pl,p\)  be  the  initial  marginal  distribution 
for  the  states  at  time  t  =  1.  Then  the  distribution  of  the  states  at  time  n  is 
given  by  p"  =  p^P^.  As  n  increases,  p"  approaches  a  steady  state  or  equilibrium 
distribution  that  satisfies  p  —  pP.  The  solution  to  this  equation  is  given  by 
Pi  =  P{Yt  =  1)  =  E[yt]  —  Poi/{Poi  +  Pw)  and  is  used  to  derive  marginal  moments 
implied  by  the  transitional  model.  For  example,  it  can  be  shown  (Kedem,  1980) 
that  in  the  steady  state,  the  marginal  variance  and  correlation  implied  by  the 
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transitional  model  are  var(yt)  =  PoPi  (as  it  should  be)  and  corr(?/t-i,2/«)  =  Pii  -  Poi, 
respectively. 

Azzalini  (1994)  models  serial  dependence  in  binary  data  through  transition 
models,  but  at  the  same  time  retains  the  marginal  interpretation  of  regression 
parameters.  He  specifies  the  marginal  regression  model  logit(/it)  =  x[0  for  a 
binary  time  series  {yt}  with  E[yt]  =  a**,  but  assumes  that  a  binary  Markov  chain 
with  transition  probabilities  Pab  has  generated  the  data.  Therefore,  the  likelihood 
refers  to  these  probabilities  but  the  model  specifies  marginal  probabilities,  a 
complication  similar  to  the  fitting  of  marginal  models  discussed  in  the  previous 
section.  However,  assuming  a  constant  log  odds  ratio 


^  =  log 


;_i=0,Y,  =  0)\ 
t-,  =  l,Yt  =  0)) 


P{Yt^r  =  0,Yt  =  l)P{Yt. 

between  any  two  adjacent  observations,  Azzalini  (1994)  shows  how  to  write 
Pab  in  terms  of  just  this  log  odds  ratio  9  and  the  marginal  probabilities  //t  and 
Ht-i-  Maximum  likelihood  estimation  for  such  models  is  tedious  but  possible  in 
closed  form,  although  second  derivatives  of  the  log  likelihood  function  have  to  be 
calculated  numerically.  A  software  package  (the  S-plus  function  rm. tools,  Azzalini 
and  Chiogna,  1997)  exists  to  fit  such  models  for  binary  and  Poisson  observations. 
Azzalini  (1994)  mentions  that  this  basic  approach  can  be  extended  to  include 
variable  odds  ratios  between  any  two  adjacent  observations,  possibly  depending  on 
covariates,  but  this  is  not  pursued  in  the  article.  Diggle  et  al.  (2002)  discuss  these 
marginalized  transitional  models  further. 

Chapter  2  in  Kedem  and  Fokianos  (2002)  presents  a  detailed  discussion  of 
partial  likelihood  estimation  for  transitional  binary  models  and  discusses,  among 
other  examples,  the  eruption  data  of  the  Old  Faithful  geyser  which  we  will  turn  to 
in  Chapter  5. 
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1.4     Random  Effects  Models 

A  popular  way  of  modeling  correlation  among  dependent  observations  is  to 
include  random  effects  u  in  the  linear  predictor.  One  of  the  first  developments  for 
discrete  data  occurred  for  longitudinal  binary  data,  where  subject-specific  random 
effects  induced  correlation  between  repeated  binary  measurements  on  a  subject 
(Bock  and  Aitkin,  1981;  Stiratelli,  Laird  and  Ware,  1984).  In  general,  we  assume 
that  unmeasurable  factors  give  rise  to  the  dependency  in  the  data  {yj  and  random 
effects  {ttf}  represent  the  heterogeneity  due  to  these  unmeasured  factors.  Given 
these  effects,  the  responses  are  assumed  independent.  However,  no  values  for  these 
factors  are  observed,  and  so  marginally  (i.e.,  averaged  over  these  factors),  the 
responses  are  dependent. 

Conditional  on  some  random  effects,  we  consider  models  that  fit  into  the 
framework  of  GLMs  for  independent  data,  i.e.,  where  the  conditional  distribution 
of  yt  I  U(  is  a  member  of  the  family  of  exponential  distributions,  whose  mean 
E[yt  I  Ut]  is  modeled  as  a  function  of  a  Unear  predictor  rjt  =  ^[0  +  z[ut.  Together 
with  a  distributional  assumption  for  the  random  effects  (usually  independent  and 
identically  normal),  this  leads  to  generalized  Unear  mixed  models  (GLMMs),  where 
the  term  mixed  refers  to  the  mixture  of  fixed  and  random  effects  in  the  Unear  pre- 
dictor. Chapter  2  contains  a  detailed  definition  of  GLMMs  and  discusses  maximum 
Ukelihood  fitting  and  parameter  interpretation  and  in  Chapter  3  correlated  random 
effects  for  the  description  of  time  dependent  observations  {yt}  are  motivated  and 
described.  Here,  we  only  give  a  short  literature  review  about  GLMMs  which  use 
correlated  random  effects  to  model  time  (or  space)  dependent  data. 
1.4.1      Correlated  Random  Effects  in  GLMMs 

One  of  the  first  papers  considering  correlated  random  effects  in  GLMMs  for 
the  description  of  (spatial)  dependence  in  Poisson  data  is  Breslow  and  Clayton 
(1993),  who  analyze  lip  cancer  rates  in  Scottish  counties.  They  propose  correlated 
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normal  random  effects  to  capture  the  correlation  in  counts  of  adjacent  districts  in 
Scotland.  A  random  effect  is  assigned  to  each  district,  and  two  random  effects  are 
correlated  if  their  districts  are  adjacent  to  each  other. 

In  Section  1.2.2  we  mentioned  the  Polio  data  set  of  a  time  series  of  equally 
spaced  counts  {yt}lt\  and  formulated  the  conditional  model  (1.2)  with  a  latent 
process  for  the  random  effects.  Instead  of  obtaining  marginal  moments  as  in  Zeger 
(1988),  Chan  and  Ledolter  (1995)  use  a  GLMM  approach  with  Poisson  random 
components  and  autoregressive  random  effects  to  analyze  the  time  series.  They 
outline  parameter  estimation  via  an  MCEM  algorithm  similar  to  the  one  discussed 
in  Sections  2.4  and  3.2  in  this  dissertation. 

One  of  the  three  central  generalized  linear  models  advocated  by  Diggle  et  al. 
(2002,  Chap.  11.2)  to  model  longitudinal  data  uses  correlated  random  effects.  For 
equally  spaced  binary  longitudinal  data  {yu},  they  plot  response  profiles  simulated 
according  to  the  model 

iogit[p(yit  =  1 1  Uit)]  =  ^  +  uu 

with  (7^  =  2.5^  and  p  =  0.9  and  note  that  the  profiles  exhibit  more  alternating 
runs  of  O's  and  I's  than  a  random  intercept  model  with  ua  =  Ui2  =  ■  ■  ■  = 
UiT  =  Ui.  However,  based  on  the  similarity  between  plots  of  random  intercepts, 
random  intercepts  and  slopes  and  autoregressive  random  effects  models,  they 
mention  the  challenge  that  binary  data  present  in  distinguishing  and  modeling  the 
underlying  dependency  structure  in  longitudinal  data.  (They  used  T  =  25  repeated 
observation  for  their  simulations.)  Furthermore,  they  state  that  numerical  methods 
for  maximum  likelihood  estimation  are  computationally  impractical  for  fitting 
models  with  higher  dimensional  random  effects.  This  makes  it  impossible,  they 
conclude,  to  fit  the  GLMM  with  serially  correlated  random  effects  using  maximum 
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likelihood.  Instead,  they  propose  a  Bayesian  analysis  using  powerful  Monte  Carlo 
Markov  chain  methods. 

Indeed,  the  majority  of  examples  in  the  literature  which  consider  correlated 
random  effects  in  a  GLMM  framework  take  a  Bayesian  approach.  Sun,  Speckman 
and  Tsutakawa  (2000)  explore  several  types  of  correlated  random  effects  (autore- 
gressive,  generalized  autoregressive  and  conditional  autoregressive)  in  a  Bayesian 
analysis  of  a  GLMM.  As  in  any  Bayesian  analysis,  the  propriety  of  the  posterior 
distribution  given  the  data  is  of  concern  when  fixed  effects  and  variance  compo- 
nents have  improper  prior  distributions  and  random  effects  are  (possibly  singular) 
multivariate  normal.  One  of  their  results  applied  to  Poisson  or  binomial  data  {yt} 
states  that  the  posterior  might  be  improper  when  yt  =  0  in  the  Poisson  case  and 
cannot  be  proper  when  yt  =  0  or  yt  =  fit  in  the  binomial  case  for  any  t  when 
improper  or  non-informative  priors  are  used. 

Diggle,  Tawn  and  Moyeed  (1998)  consider  Gaussian  spatial  processes  S{x) 
to  model  spatial  count  data  at  locations  x.  The  role  of  S{x)  is  to  explain  any 
residual  spatial  variation  after  accounting  for  all  known  explanatory  variables. 
They  also  use  a  Bayesian  framework  to  estimate  parameters  and  give  a  solution  to 
the  problem  of  predicting  the  count  at  a  new  location  x.  Ghosh  et  al.  (1998)  use 
correlated  random  effects  in  Bayesian  models  for  small  area  estimation  problems. 
They  present  an  application  of  pairwise  difference  priors  for  random  effects 
to  model  a  series  of  spatially  correlated  binomial  observations  in  a  Bayesian 
framework.  Zhang  (2002)  discusses  maximum  likelihood  estimation  with  an 
underlying  spatial  Gaussian  process  for  spatially  correlated  binomial  observations. 

Bayesian  models  for  binary  time  series  are  described  in  Liu  (2001),  based 
on  probit-type  models  for  correlated  binary  data  which  are  discussed  in  Chib 
and  Greenberg  (1998).  Probit  type  models  are  motivated  by  assuming  latent 
random  variables  z  =  (^i, . . . ,  2t),  which  follow  a  N{yi,  E)  distribution  with  .;| 
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fji  =  (/ii, . . . ,  ht),  Ht  =  2;[/3  and  T,  a  correlation  matrix.  The  ytS  are  assumed  to  be 
generated  according  to 

yt  =  I{zt  >  0), 

where  /(.)  is  the  indicator  function.  This  leads  to  the  (marginal)  probit  model 
P{Yt  =  1  I  /3,  S)  =  ^t{zt  I  At,S).  Rich  classes  of  dependency  structures  between 
binary  outcomes  can  be  modeled  through  E.  These  models  can  further  be  extended 
to  include  random  effects  through  //'t  =  x[l3  +  z[ut  or  q  previous  responses  such 
as  fit  —  x'tP  +  YLl=i  o^rVt-r-  It  is  important  to  note  that  S  has  to  be  in  correlation 
form.  To  see  this,  suppose  it  is  not  and  let  S  =  DED  be  a  covariance  matrix 
for  the  latent  random  variables  z,  where  D  is  a  diagonal  matrix  holding  standard 
deviation  parameters.  The  joint  density  of  the  times  series  under  the  multivariate 
probit  model  is  given  by 

P[(Fi,...,F,)  =  (2/i,---,?/t)]    -    P[zeA] 

=  P[D-^zeA], 

where  A  =  Ai  x  ■  ■  ■  x  At  with  At  =  (-oo,  0]  if  ?/<  =  0  and  At  =  (0,  oo)  if  t/t  =  1 
are  the  intervals  corresponding  to  the  relationship  yt  —  I{zt  >  0),  for  <  =  1, . . . ,  T. 
However,  above  relationship  is  true  for  any  parametrization  of  D,  because  the 
intervals  At  are  not  affected  by  the  transformation  from  z  to  D~^z.  Hence,  the 
elements  of  D  are  not  identifiable  based  on  the  joint  distribution  of  the  observed 
time  series  y. 

Lee  and  Nelder  (2001)  present  models  to  analyze  spatially  correlated  Poisson 
counts  and  binomial  longitudinal  data  about  cancer  mortality  rates.  They  explore 
a  variety  of  patterned  correlation  structures  for  random  effects  in  a  GLMM  setup. 
Model  fitting  is  based  on  the  joint  data  likelihood  of  observations  and  unobserved 
random  effects  (Lee  and  Nelder,  1996)  and  not  on  the  marginal  likelihood  of  the 
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observed  data.  Model  diagnostic  plots  of  estimated  random  effects  are  presented  to 
aid  in  selecting  an  appropriate  correlation  structure. 
1.4.2     Other  Modeling  Approaches 

In  hidden  Markov  models  (MacDonald  and  Zucchini,  1997)  the  underlying 
random  process  is  assumed  to  be  a  discrete  state-space  Markov  chain  instead 
of  a  continuous  (normal)  process.  Probability  transition  matrices  describe  the 
connection  between  states.  A  very  convenient  property  of  hidden  Markov  models 
is  that  the  likelihood  can  be  evaluated  sufficiently  fast  to  permit  direct  numerical 
maximization.  MacDonald  and  Zucchini  (1997)  present  a  detailed  description  of 
hidden  Markov  models  for  the  analysis  of  binary  and  count  time  series. 

A  connection  between  transitional  models  and  random  effects  models  is 
explored  in  Aitkin  and  Alfo  (1998).  They  model  the  success  probabilities  of 
serial  binary  observations  conditional  on  subject-specific  random  effects  and  on 
the  previous  outcome.  As  in  the  models  before,  transition  probabilities  Pab  are 
changing  over  time  due  to  the  inclusion  of  time-dependent  covariates  and  the 
previous  observation  in  the  linear  predictor.  Additionally,  random  effects  account 
for  possibly  unobserved  sources  of  heterogeneity  between  subjects.  The  authors 
argue  that  the  conditional  model  specification  together  with  the  specification  of 
the  random  effects  distribution  does  not  determine  the  distribution  of  the  initial 
observation,  and  hence  the  likelihood  for  this  model  is  unspecified.  They  present 
a  solution  by  maximizing  the  likelihood  obtained  from  conditioning  on  this  first 
observation.  However,  this  causes  the  specified  random  effects  distribution  to  shift 
to  an  unknown  distribution.  Two  approaches  for  estimation  are  outlined:  The  first 
assumes  another  normal  distribution  for  the  new  random  effects  distribution  and 
the  likelihood  is  maximized  using  Gauss-Hermite  quadrature.  The  second  approach 
assumes  no  parametric  form  for  the  new  random  effects  distribution  and  follows  the 
nonparametric  maximum  likelihood  approach  (Aitkin,  1999).  For  binary  data,  the 
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new  random  effects  distribution  is  only  a  two  point  distribution  and  its  parameters 
can  be  estimated  via  maximum  likelihood  jointly  with  the  other  model  parameters. 

Marginalized  transitional  models  were  briefly  mentioned  with  the  approach 
taken  by  Azzalini  (1994).  The  idea  of  "marginalizing",  i.e.,  model  the  marginal 
mean  of  an  otherwise  conditionally  specified  model  can  also  be  applied  to  random 
effects  models.  The  advantage  of  transitional  or  random  effects  models  is  the 
ability  to  easily  specify  correlation  patterns,  with  the  potential  disadvantage  that 
parameters  in  such  models  have  conditional  interpretations  when  the  scientific  goal 
is  on  the  interpretation  of  marginal  relationships.  In  marginal  models  parameters 
can  directly  be  interpreted  as  contrasts  between  subpopulations  without  the  need 
of  conditioning  on  previous  observations  or  unobserved  random  effects.  However,  as 
we  mentioned  in  Section  1.2.2,  likelihood  based  inference  in  marginal  models  might 
not  be  possible. 

A  marginalized  random  effects  model  (Heagerty,  1999;  Heagerty  and  Zeger, 
2000)  specifies  two  regression  equations  that  are  consistent  with  each  other.  The 
first  equation, 

fif"  =  E[y,]  =  h-\x[l3) 

expresses  the  marginal  mean  fj,f  as  a  function  of  covariates  and  describes  system- 
atic variation.  The  second  equation  characterizes  the  dependency  structure  among 
observations  through  specification  of  the  conditional  mean  /if, 

fif  =  E[yt  I  ut]  =  h-\At{xt)  +  z'tut), 

where  Ut  are  random  eflFects  with  design  vector  Zt.  Consistency  between  the 
marginal  and  conditional  specification  is  achieved  by  defining  At{xt)  implicitly 
through 

H^  =  Euifif]  =  Eu  [E[yt  \  w,]]  =  Eu[h-\l^t{xt)  +  z[ut)]. 
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For  instance,  in  a  marginalized  GLMM  with  random  effects  distribution  F{ut), 
At{xt)  is  the  solution  to  the  integral  equation 

/zf  =  h-'{x[0)  =  Jh-\A,{xt)  +  z[ut)dF{ut), 

so  that  At{xt)  is  a  function  of  the  marginal  regression  coefficients  0  and  the 
(variance)  parameters  in  F{ut).  Maximum  likelihood  estimation  is  based  on  the 
integrated  likelihood  from  the  GLMM  model. 

1.5     Motivation  and  Outline  of  the  Dissertation 

In  this  dissertation  we  propose  generalized  linear  mixed  models  (GLMMs) 
with  correlated  random  effects  to  model  count  or  binomial  response  data  collected 
over  time  or  in  space.  For  sequential  or  spatial  Gaussian  measurements,  maximum 
likelihood  estimation  is  well  established  and  software  (e.g.  SAS's  proc  mixed)  is 
available  to  fit  fairly  complicated  correlation  structures.  The  challenge  for  discrete 
data  lies  in  the  fact  that  the  observed  (marginal)  likelihood  is  not  analytically 
tractable,  and  maximization  of  it  is  more  involved.  Furthermore,  with  correlated 
random  effects,  the  likelihood  does  not  break  down  into  lower-dimensional  com- 
ponents which  are  easier  to  integrate  numerically.  Therefore,  most  approaches  in 
the  literature  are  beised  on  a  quasi-likelihood  approach  or  take  a  Bayesian  per- 
spective. The  advantage  of  Bayesian  models  is  that  powerful  Monte  Carlo  Markov 
chain  methods  make  it  easier  to  obtain  a  sample  from  the  posterior  distribution  of 
interest  than  to  obtain  maximum  likelihood  estimates.  However,  priors  must  be 
specified  very  carefully  to  ensure  posterior  propriety. 

In  addition,  repeated  observations  are  prone  to  missing  data  or  unequally 
spaced  observation  times.  We  would  like  to  develop  methods  and  models  that  allow 
for  unequally  spaced  binary,  binomial  or  Poisson  observations,  making  them  more 
general  than  previously  presented  in  the  literature. 
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To  our  knowledge,  maximum  likelihood  estimation  of  GLMMs  with  such  high 
dimensional  random  effects  has  not  been  demonstrated  before,  with  the  exception 
of  the  paper  by  Chan  and  Ledolter  (1995)  who  consider  fitting  of  a  time  series 
of  counts.  However,  they  do  not  consider  unequally  spaced  data  and  employ  a 
different  implementation  of  the  MCEM  algorithm.  In  Chapter  5  we  argue  that 
their  implementation  of  the  algorithm  might  have  been  stopped  prematurely, 
leading  to  different  conclusions  than  our  analysis  and  analyses  published  elsewhere. 
Most  articles  that  discuss  correlated  random  effects  do  so  for  only  a  small  number 
of  correlated  random  effects.  E.g.,  Chan  and  Kuk  (1997)  show  that  the  data  set 
on  salamander  mating  behavior  published  and  analyzed  in  McCuUagh  and  Nelder 
(1989)  is  more  appropriately  analyzed  when  random  effects  pertaining  to  the  male 
salamander  population  are  correlated  over  the  three  different  time  points  when  they 
were  observed.  In  this  thesis,  we  would  like  to  consider  much  longer  sequences  of 
repeated  observations. 

In  Chapter  2  we  introduce  the  GLMM  as  the  model  of  our  choice  to  analyze 
correlated  discrete  data  and  outline  an  EM  algorithm  to  estimate  fixed  and  random 
effects,  where  both  the  E-step  and  the  M-step  require  numerical  approximations, 
leading  to  an  EM  algorithm  based  on  Monte  Carlo  methods  (MCEM).  Correlated 
random  effects  and  their  implications  on  the  analysis  of  GLMMs  are  discussed  in 
Chapter  3,  together  with  a  motivating  example.  This  chapter  also  gives  details  for 
the  implementation  of  the  algorithm  and  reports  results  from  simulation  studies. 
Chapter  4  looks  at  marginal  model  properties  and  interpretation  for  correlated 
binary,  binomial  or  Poisson  observations  and  Chapter  5  applies  our  methods  to 
real  data  sets  from  the  social  sciences,  public  health,  sports  and  other  backgrounds. 
A  summary  and  discussion  of  the  methods  and  models  presented  here  is  given  in 
Chapter  6. 


CHAPTER  2 
GENERALIZED  LINEAR  MIXED  MODELS 

Chapter  1  reviewed  various  approaches  of  extending  GLMs  to  deal  with 
correlated  data.  In  this  Chapter  we  will  take  a  closer  look  at  generalized  linear 
mixed  models  (GLMMs)  which  were  briefly  mentioned  in  Section  1.4.  When 
the  response  variables  are  normal,  these  models  are  simply  called  linear  mixed 
models  (LMMs)  and  have  been  extensively  discussed  in  the  literature  (see,  for 
example,  the  books  by  Searle,  Casella  and  McCulloch,  1992,  and  Verbeke  and 
Molenberghs,  2000).  The  form  of  the  normal  density  for  observations  and  random 
effects  allows  for  analytical  evaluation  of  the  integrals  together  with  straightforward 
maximization.  Hence,  LMMs  can  be  readily  fit  with  existing  software  (e.g.,  SAS's 
proc  mixed),  using  rich  classes  of  pre-specified  correlation  structures  for  the  random 
effects  to  model  the  dependence  in  the  data  more  precisely.  The  broader  notion 
of  GLMMs  also  encompasses  binary,  binomial,  Poisson  or  gamma  responses. 
A  distinctive  feature  of  GLMMs  is  their  so  called  subject-specific  parameter 
interpretation,  which  differs  from  the  interpretation  of  parameters  in  marginal 
(Section  1.2)  or  transitional  (Section  1.3)  models.  This  feature  is  discussed  in 
Section  2.1,  after  a  formal  introduction  of  the  GLMM.  Throughout,  special 
attention  is  devoted  to  define  GLMMs  for  discrete  time  series  observations. 

GLMMs  are  harder  to  fit  because  they  typically  involve  intractable  integrals 
in  the  likelihood  function.  Section  2.2  outlines  various  approaches  to  model  fitting. 
Section  2.3  focuses  on  a  Monte  Carlo  version  of  the  EM  algorithm  which  is  an 
indirect  method  of  finding  maximum  likelihood  estimates  in  GLMMs.  Monte  Carlo 
methods  are  necessary  because  our  applications  involve  correlated  random  effects 
which  lead  to  a  very  high-dimensional  integral  in  the  likelihood  function.  Parallel 
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to  the  discussion  of  GLMMs,  state  space  models  are  introduced  and  a  fitting 
..  algorithm  is  described.  State  space  models  are  popular  models  for  discrete  time 

series  in  econometric  applications  (Durbin  and  Koopman,  2001).  The  presentation 
-    :;     of  specific  examples  of  GLMMs  for  discrete  time  series  observation  is  deferred  until 
Chapter  5. 

2.1      Definition  and  Notation 
The  generalized  linear  mixed  model  is  an  extension  of  the  well  known  gen- 
eralized linear  model  (McCullagh  and  Nelder,  1989)  that  permits  fixed  as  well  as 
,:'^      random  effects  in  the  linear  predictor  (hence  the  word  mixed).  The  setup  process 
for  GLMMs  is  split  into  two  stages,  which  we  present  here  using  notation  common 
for  longitudinal  studies.: 
,:.  ■v>'^^  Firstly,  conditional  on  cluster  specific  random  effects  Wj,  the  data  are  assumed 

to  follow  a  GLM  with  independent  random  components  Y^,  the  t—th  response 
in  cluster  i,  i  =  1, . . . ,  n,  <  =  1, . . . ,  nj.  A  cluster  here  is  a  generic  expression 
and  means  any  form  of  observations  being  grouped  together,  such  as  repeated 
,,    ■:  observation  on  the  same  subject  (cluster  =  subject),  observations  on  different 

'"  ■'!   '  .'     ■  ri 

'■."•    --.A  '.  -     .       ■      ' 

^{i.  /■■  students  in  the  same  school  (cluster  =  school)  or  observations  recorded  in  a 

common  time  interval  (cluster  =  time  interval).  The  conditional  distribution  of  Yn 
is  a  member  of  the  exponential  family  of  distributions  (e.g.,  McCullagh  and  Nelder, 
1989)  with  form 

fiVit  I  Ui)  =  exp  {[yuOit  -  b{9it)]/(l)it  +  c{yu,  (j)it)} ,  (2.1) 

'.  where  On  are  natural  parameters  and  b{.)  and  c(.)  are  certain  functions  determined 

by  the  specific  member  of  the  exponential  family.  The  parameters  0it  are  typically 
of  form  0it  =  (p/wit  where  the  wn's  are  known  weights  and  ^  is  a  possibly  unknown 
dispersion  parameter.  For  the  discrete  response  GLMMs  we  are  considering, 
<p  =  1.  For  a  specific  link  function  h{.),  the  model  for  the  conditional  mean  for 
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observations  yu  has  form 

t^it  =  b'iOit)  =  E[Yu  I  Ui]  =  h-\x',tl3  +  z>i),  (2.2) 

where  x'^  and  z^j  are  covariate  or  design  vectors  for  fixed  and  random  effects 
associated  with  observation  yu  and  /3  is  a  vector  of  unknown  regression  coefficients. 
At  this  first  stage,  z^^Wj  can  be  regarded  as  a  known  off'set  for  each  observation, 
and  observations  are  conditionally  independent. 

It  should  be  noted  that  relationship  (2.2)  between  the  mean  of  the  observation 
and  fixed  and  random  effects  is  exactly  as  is  specified  in  the  systematic  part  of 
GLMs,  with  the  exception  that  in  GLMMs  a  conditional  mean  is  modeled.  This 
affects  parameter  interpretation.  The  regression  coefficients  /3  represent  the  effect 
of  explanatory  variables  on  the  conditional  mean  of  observations,  given  the  random 
effects.  For  instance,  observations  in  the  same  cluster  i  share  a  common  value 
of  the  random  cluster  eflfect  Ui,  and  hence  /3  describes  the  conditional  effect  of 
explanatory  variables,  given  the  value  for  Uj.  If  the  cluster  consists  of  repeated 
observations  on  the  same  subject,  these  effects  are  called  subject-specific  effects. 
In  contrast,  regression  coefficients  in  GLMs  and  marginal  models  describe  the 
effect  of  explanatory  variables  on  the  population  average,  which  is  an  average  over 
observations  in  different  clusters. 

At  the  second  stage,  the  random  effects  Wj  are  specified  to  follow  a  multi- 
variate normal  distribution  with  mean  zero  and  variance-covariance  matrix  Sj.  A 
standard  assumption  is  that  random  effects  {wj}  are  independent  and  identically 
distributed,  but  an  example  at  the  beginning  of  Chapter  3  will  show  that  this 
is  sometimes  not  appropriate.  With  time  series  observations,  where  the  clusters 
refer  to  time  segments,  it  is  reasonable  to  assume  that  observations  are  not  only 
correlated  within  the  cluster  (modeled  by  sharing  the  same  cluster  specific  random 
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effect),  but  also  across  clusters,  which  we  will  model  by  assuming  correlated  cluster 

specific  random  effects. 

2.1.1      Generalized  Linear  Mixed  Models  for  Univariate  Discrete  Time 
Series  "      ',   ' 

Most  of  the  data  we  are  going  to  analyze  is  in  the  form  of  a  single  univariate 
time  series.  To  emphasize  this  data  structure,  the  general  two-dimensional  notation 
(indices  i  and  t)  of  a  GLMM  to  model  observations  which  come  in  clusters  can  be 
simplified  in  two  ways: 

We  can  assume  that  a  single  cluster  (i.e.,  n  =  1  and  rii  —  T)  contains  the 
entire  time  series  ?/i, . . . ,  j/t-  The  random  effects  vector  u  =  (wi, . . . ,  ut)  associ- 
ated with  the  single  cluster  has  a  random  effects  component  for  each  individual 
time  series  member.  The  distribution  of  w  is  multivariate  normal  with  variance- 
covariance  matrix  S,  which  is  different  from  the  identity  matrix.  The  correlation 
of  the  components  of  u  induce  a  correlation  among  the  time  series  members. 
However,  conditional  on  u,  observations  within  the  single  cluster  are  independent. 
The  cluster  index  i  is  redundant  in  the  notation  and  hence  can  be  dropped.  This 
representation  is  particulary  useful  when  used  with  existing  software  to  fit  GLMMs, 
where  it  is  often  necessary  to  include  a  column  indicating  the  cluster  membership 
information  for  each  observation.  Here,  since  we  have  only  one  cluster,  it  suffices  to 
include  a  column  of  all  ones,  say. 

Alternatively,  we  can  adopt  the  point  of  view  that  each  member  of  the  time 
series  is  a  mini  cluster  by  itself,  containing  only  one  observation  (i.e.,  n;  =  1  for 
alH  =  1, . . . ,  T)  in  the  case  of  a  single  time  series.  When  multiple,  parallel  time 
series  are  observed,  the  cluster  contains  all  c  observations  at  time  point  t  from  the 
c  parallel  time  series  (i.e.,  Wj  =  c  for  alH  =  1, . . . ,  T).  In  any  case,  the  clusters 
are  then  synonymous  with  the  discrete  time  points  at  which  observations  were 
recorded.  This  makes  index  f,  which  counts  the  repeated  observations  in  a  cluster 
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redundant  (f  =  1  or  i  =  c  for  all  clusters  i),  but  instead  of  denoting  the  time  series 
by  {?/i}"=i,  we  decided  to  use  the  more  common  notation  {yt}l=i,  where  t  now  is 
the  index  for  clusters  or,  equivalently,  time  points.  In  the  following  definition  of 
GLMMs  for  univariate  time  series,  the  notation  of  clusters  or  time  points  can  be 
used  interchangeably.  Conditional  on  unobserved  random  effects  mi,  . . . ,  Uy  for 
the  different  time  points,  observations  t/i, . . . ,  j/y  are  assumed  independent  with 
distributions 

/(y,  1  «,)  =  exp  {[ytOt  -  b{et)]/(l>t  +  c{yt,  M  (2.3) 

in  the  exponential  family.  As  before,  for  a  specific  link  function  h{.),  the  model  for 
the  conditional  mean  has  form 

E[yt  I  ut]  =  ^it^  h'iOt)  -  h-\x',p  +  z\ut\  (2.4) 

where  x[  and  z[  are  covariate  or  design  vectors  for  fixed  and  random  effects 
associated  with  the  t-th  observation  and  /3  is  a  vector  of  unknown  regression 
coefficients.  The  random  effects  Mi,  . . . ,  Mt  are  typically  not  independent.  When 
collected  in  the  vector  u  =  (ui, . . . ,  Wr),  a  multivariate  normal  distribution  with 
mean  0  and  covariance  matrix  S  can  be  directly  specified.  In  particular,  in  Chapter 
3  we  will  assume  special  patterned  covariance  matrices  to  allow  for  rich,  but  still 
parsimonious,  classes  of  correlation  structures  among  the  time  series  observations. 

The  advantage  of  the  second  setup  of  mini  clusters  is  that  it  also  allows  for 
other,  indirect  specifications  of  the  random  effects  distribution,  for  instance  through 
a  latent  random  process.  For  this,  we  relate  cluster-specific  random  efltects  from 
successive  time  points.  For  example,  with  univariate  random  effects,  a  first-order 
latent  autoregressive  process  assumes  that  the  random  effects  follow 

ut+i  =  put  +  e(,  (2.5) 


^  '-. 
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where  C(  has  a  zero-mean  normal  distribution  and  p  is  a  correlation  parameter. 
Cox  (1981)  called  these  type  of  models  parameter-driven  models,  as  opposed  to 
transitional  (or  observation-driven)  models  discussed  in  Section  1.3.  In  parameter- 
driven  models,  an  underlying  and  unobserved  parameter  process  influences  the 
distribution  of  a  series  of  observations.  The  model  for  the  polio  data  in  Zeger 
(1988)  is  an  example  of  a  parameter-driven  model.  However,  Zeger  (1988)  does  not 
assume  normality  nor  zero  mean  for  the  latent  autoregressive  process.  Furthermore, 
the  natural  logarithm  of  the  random  effects,  and  not  the  random  effects  themselves 
appear  additively  in  the  linear  predictor.  Therefore,  this  model  is  slightly  different 
from  the  specifications  of  the  time  series  GLMM  from  above. 

Another  application  of  the  mini  cluster  setup  is  to  spatial  settings,  where  clus- 
ters represent  spatially  aggregated  data  instead  of  time  points.  Then  Ui, . . . ,  mt  is 
a  collection  of  random  effects  associated  with  spatial  clusters.  Again,  independent 
random  effects  to  describe  the  spatial  dependencies  are  inappropriate.  In  general, 
time  dependent  data  are  easier  to  handle  since  observations  are  linearly  ordered 
in  time,  and  more  complicated  random  effects  distributions  are  needed  for  spatial 
applications  (e.g.,  Besag  et  al.  1995). 

We  will  use  the  mini-cluster  representation  to  facilitate  a  comparison  to  state 
space  models.  This  is  the  focus  of  the  next  section. 
2.1.2     State  Space  Models  for  Discrete  Time  Series  Observations 

State  space  models  are  a  rich  alternative  to  the  traditional  Box-Jenkins 
ARIMA  system  for  time  series  analysis.  Similar  to  GLMMs,  state  space  models 
for  Gaussian  and  non-Gaussian  time  series  split  the  modeling  process  into  two 
stages:  At  the  first  stage,  the  responses  y*  are  related  to  unobserved  "states"  by  an 
observation  equation.  (State  space  models  originated  in  the  engineering  literature, 
where  parameters  are  often  called  states.)  At  the  second  stage,  a  latent  or  hidden 
Markov  model  is  assumed  for  the  states.  For  univariate  Gaussian  responses 
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Vxi-  ■  ■  iVti  the  two  equations  of  a  state  space  model  take  the  form 

Vt    =    w[ott  +  tt,    Q~Af(0,a2),  (2.6) 

at    =    TtOtt-y  +  Rt^,,    ^,^N{0,Qt)  t  =  l,...,T, 

where  lOt  is  an  m  x  1  observation  or  design  vector  and  et  is  a  white  noise  process. 
The  unobserved  m  x  1  state  or  parameter  vector  at  is  defined  by  the  second 
transition  equation,  where  Tt  is  a  transition  matrix  and  ^<  is  another  white  noise 
process,  independent  of  the  first  one.  Compared  to  the  standard  GLMMs  of 
Section  2.1,  the  main  difference  is  that  random  effects  are  correlated  instead  of 
i.i.d.  In  state  space  models,  no  clear  distinction  between  fixed  and  random  effects 
is  made,  and  the  state  vector  at  can  contain  both.  However,  the  form  of  the 
transition  matrix  Tt  together  with  the  form  of  the  matrix  Rt,  which  consists  of 
columns  of  the  identity  matrix  7^,  allows  one  to  declare  certain  elements  of  at  as 
being  fixed  eflfects  and  others  to  be  random.  The  matrix  Rt  is  called  the  selection 
matrix  since  it  selects  the  rows  of  the  state  equation  which  have  nonzero  variance 
terms.  With  this  formulation,  the  variance-covariance  matrix  Qt  is  assumed  to 
be  non-singular.  Furthermore,  the  transition  matrix  Tt  allows  specification  of 
which  effects  vary  through  time  and  which  stay  constant.  (For  a  slightly  different 
formulation  without  a  selection  matrix  Rt  but  with  possibly  singular  variance- 
covariance  matrix  Qt,  see  Fahrmeir  and  Tutz,  2001,  Chap.  8). 

State  space  models  for  non-Gaussian  time  series  were  considered  by  West 
et  al.  (1985)  under  the  name  dynamic  generalized  linear  model.  They  used  a 
Bayesian  framework  with  conjugate  priors  to  specify  and  fit  their  models.  Durbin 
and  Koopman  (1997,  2000)  and  Fahrmeir  and  Tutz  (2001)  describe  a  state  space 
structure  for  non-Gaussian  observations  similar  to  the  two  equations  above. 
The  normal  distribution  assumption  for  the  observations  in  (2.6)  is  replaced  by 
assuming  a  distribution  in  the  exponential  family  with  natural  parameters  6t.  With 
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a  canonical  link,  9t  —  w'f.OLt.  This  is  called  the  signal  by  Durbin  and  Koopman 
(1997,  2000).  In  particular,  given  the  states  ai, . . . ,  a^,  observations  yi, . . . ,  2/t 
are  conditionally  independent  and  have  density  p{yt  \  6t)  in  the  exponential  family 
(2.3).  As  in  Gaussian  state  space  models,  the  state  vector  ott  is  determined  by  the 
vector  autoregressive  relationship 


at  =  Ttott-i  +  Rtiv 


(2.7) 


where  the  serially  independent  $,^  typically  have  normal  distributions  with  mean  0 

and  variance-covariance  matrix  Qt- 

2.1.3     Structural  Similarities  Between  State  Space  Models  and  GLMMs 

There  is  a  strong  connection  between  state  space  models  and  GLMMs  with 
a  canonical  link.  To  see  this,  we  write  the  GLMM  in  state  space  form:  Let  Wt  = 
(icj,  z\)'  and  ott  =  (/9',  wj)',  where  Xt,  Zt,  /3  and  Ut  are  from  the  GLMM  notation 
as  defined  in  (2.4).  Hence,  the  linear  predictor  cCj/3  +  z[ut  of  the  GLMM  is  equal 
to  the  state  space  signal  9t  =  w[(Xt.  Next,  partition  the  disturbance  term  ^j  of 
the  state  equation  into  ^^  =  {^]  ,  $,^  )'  and  consider  special  transition  and  selection 
matrices  of  block  form 


Tf  = 


I     0 
0    ft 


Rf. 


0     0 

0    Rt 


Using  transition  equation  (2.7)  results  in  the  following  autoregressive  relationship 
between  the  random  effects  of  a  GLMM:  Ut  =  ftUt-i  +  Cj,  where  €<  =  Rt^] 
is  a  white  noise  component.  In  a  univariate  context,  we  have  already  motivated 
this  type  of  relationship  between  random  effects  in  equation  (2.5).  The  transition 
equation  also  implies  a  constant  effect  /3  for  the  GLMM,  since  0^  =  02  —  •  •  •  — 
firp  :—  /3.  Hence,  both  models  use  correlated  random  effects,  but  GLMMs  also 
typically  involve  fixed  parameters  which  are  not  modeled  as  evolving  over  time. 
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The  restriction  of  the  transition  equation  to  the  autoregressive  form  (often 
a  simple  random  walk)  is  but  only  one  way  of  specifying  a  distribution  for  the 
random  effects  in  the  GLMMs  of  Section  2.2.1.  Other  structures,  such  as  equally 
correlated  random  effects  are  possible  within  the  GLMM  framework  and  are 
considered  in  Chapter  3. 
2.1.4     Practical  Differences 

Although  GLMMs  and  state  space  models  are  similar  in  structure,  they  are 
used  differently  in  practice.  This  is  in  part  due  to  the  fact  that  in  GLMMs  the 
focus  is  on  the  fixed  subject-specific  regression  parameters  /3,  which  refer  to  time- 
constant  and  time  varying  covariates,  while  in  state  space  models  the  main  purpose 
is  to  infer  properties  about  the  time  varying  random  states  at.  These  are  often 
assumed  to  follow  a  first  or  second  order  random  walk.  To  illustrate,  consider  a 
data  set  about  a  monthly  time  series  of  counts  presented  in  Durbin  and  Koopman 
(2000)  for  the  investigation  of  the  effectiveness  of  new  seat  belt  legislation  on 
automobile  accidents.  They  specify  the  log-mean  for  a  Poisson  state  space  model  as 

log(/it)  =  ut  +  Xxt  +  7t, 

where  Ut  is  a  trend  component  following  the  random  walk 

i^t+i  =  i^t  +  ^t  ■ 

with  Ct  a  white  noise  process.  Further,  A  is  an  intervention  parameter  correspond- 
ing to  the  change  in  seat-belt  legislation  {xt  =  0  before  the  change  and  equal  to 
1  afterwards)  and  {jt}  are  fixed  seasonal  components  with  Xlt^i  7t  =  0  and  equal 
in  every  year.  The  main  focus  is  on  the  parameter  A  describing  the  drop  in  the  log 
means,  also  called  level,  after  the  seat  belt  legislation  went  into  effect.  -' 

'     ■     ■   1 
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For  the  series  at  hand,  a  GLMM  approach  would  consider  a  fixed  linear  time 
effect  ;5,  and  model  the  log-mean  as 

log(^t)  =  q;  +  /3f  +  Axi  +  7t  +  ut, 

where  a  is  the  intercept  of  the  linear  time  trend  with  slope  P  and  where  correlated 
random  effects  {mJ  account  for  the  correlation  in  the  monthly  log  means.  Similar 
to  above,  A  describes  the  effect  on  the  log  means  after  the  seat  belt  legislation  went 
into  effect  and  jt  are  fixed  seasonal  components,  equal  for  every  year.  The  trend 
component  Vt  of  the  state  space  model  corresponds  to  a  +  ut  in  the  GLMM,  which 
additionally  allows  for  a  linear  time  trend  0.  This  approach  seems  to  be  favored 
by  some  discussants  of  the  Durbin  and  Koopman  (2000)  paper.  (In  particular,  see 
the  discussions  by  Chatfield  or  Aitkin  of  the  paper  by  Durbin  and  Koopman  (2000) 
who  mention  the  lack  of  a  linear  trend  term  in  the  proposed  state  space  model.) 
An  even  better  GLMM  approach  with  linear  time  trends  explicitly  modeled 
could  use  a  change  point  formulation  in  the  linear  predictor,  with  the  month  the 
legislation  went  into  eflfect  (or  was  enforced)  as  the  change  point,  and  again  with 
correlated  random  effects  {mJ  to  capture  the  dependency  among  successive  means. 
Such  a  specification  would  be  harder  to  model  in  a  state  space  model. 

In  the  reply  to  the  discussion  of  their  paper,  Durbin  and  Koopman  (2000) 
wrote  that  the  two  approaches  (state  space  models  versus  Hierarchical  Generalized 
Linear  Models,  a  model  class  very  similar  to  GLMMs)  are  very  different  and  that 
they  regard  their  treatment  to  be  more  transparent  and  general  for  problems  that 
specifically  relate  to  time  series.  With  the  presentation  of  GLMMs  with  correlated 
random  effects  for  time  series  analysis  in  this  thesis  their  argument  might  weaken. 
For  instance,  with  the  proposal  of  autocorrelated  random  effects  Ut+i  =  put  +  et 
in  a  GLMM  context  we  have  elegant  means  of  introducing  autocorrelation  into  a 
basic  regression  model  that  is  well  understood  and  whose  parameters  are  easily 
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interpreted.  Furthermore,  GLMMs  can  easily  accommodate  the  case  of  multiple 
time  series  observations  on  each  of  several  individuals  or  cross-sectional  units,  as  is 
often  observed  in  a  longitudinal  study. 

One  common  feature  of  both  models  is  the  intractability  of  the  likelihood 
function  and  the  use  of  numerical  and  simulation  techniques  to  obtain  maximum 
likelihood  estimates.  In  general,  state  space  models  for  non-Gaussian  time  series 
are  fit  using  a  simulated  maximum  likelihood  approach,  which  is  also  a  popular 
method  for  fitting  GLMMs.  However,  long  time  series  necessarily  result  in  models 
with  complex  and  high  dimensional  random  effects,  and  alternative,  indirect,  meth- 
ods may  work  better.  Jank  and  Booth  (2003)  indicate  that  simulated  maximum 
likelihood,  the  method  of  choice  for  estimation  in  state  space  models,  may  not 
work  as  well  as  indirect  methods  based  on  the  EM  algorithm,  the  method  we  will 
use  to  fit  GLMMs  for  time  series  observations.  The  next  section  reviews  various 
approaches  of  fitting  GLMMs  and  contrasts  them  with  the  approach  taken  for  state 
space  models. 

2.2     Maximum  Likelihood  Estimation 

Maximum  likelihood  estimation  in  GLMMs  is  a  challenging  task,  because  it 
requires  the  calculation  of  integrals  (often  high  dimensional)  that  have  no  known 
analytic  solution.  Following  the  general  notation  of  a  GLMM  in  Section  2.1,  let 
Vi  =  (2/ii?  •  •  • ,  Vim)  be  the  vector  of  all  observations  in  cluster  i,  whose  associated 
random  effects  vector  is  Wj.  Conditional  independence  of  the  y^s  implies  that  the 
density  function  of  y^  is  given  by 

f{yi\ui-0)  =  l[f{yu\ui;l3),  (2.8) 

t=i 

where  /{yn  \  Ui,^)  are  the  exponential  densities  in  (2.1).  The  parameter  /3  is  the 

vector  of  all  unknown  regression  coefficients  introduced  by  specifying  model  (2.2) 

for  the  mean  of  the  observations.  Furthermore,  observations  from  different  clusters 
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are  assumed  conditionally  independent,  leading  to  the  conditional  joint  density 

n 

fiVi,  •  •  • ,  yn  I  wi, . . . ,  M„;  /3)  =  JJ  fiVi  I  uf,  /3) 

for  all  observations  given  all  random  effects.  Let  g{ui, . . . ,  w„;  ^)  denote  the  mul- 
tivariate normal  density  function  of  the  random  effects,  whose  variance-covariance 
matrix  E  is  determined  by  the  variance  component  vector  ip.  The  goal  is  to  es- 
timate the  unknown  parameter  vectors  /3  and  ip  by  maximum  likelihood.  The 
likelihood  function  L(/3,  ^;  j/^, . . . ,  y„)  for  a  GLMM  is  given  by  the  marginal  den- 
sity function  of  the  observations  j/i, . . . ,  2/„,  viewed  as  a  function  of  the  parameters, 
and  is  equal  to 


/ 


L(^,V;yi,...,y„)    =     /  /(yi,...,y„  |  ui,...,w„;/3)5(mi,...,m„;  V)cfMi  ...cfMn 

,     n 

n  f^yi  I  "^!  ^)  9{ui, . . . ,  u„;  ip)dui  ...dUn 

i-l 

/n     Tii 
n  n  •^(^"  I  "'!  I^")  ^("1'  •••'■"«;  ^)dui  ...dUn        (2.9) 
1=1  t=l 

It  is  called  the  "observed"  likelihood  because  the  unobserved  random  effects  have 

been  integrated  out  and  (2.9)  is  a  function  of  the  observed  data  only.  Except 

for  the  hnear  mixed  model  where  /(y^, . . . ,  y„  |  iti, . . . ,  w„)  is  a  normal  density, 

the  integral  has  no  closed-form  solution  and  numerical  procedures  (analytic  or 

stochastic)  are  necessary  to  calculate  and  maximize  it.  Standard  maximization 

techniques  such  as  Newton-Raphson  or  EM  for  fitting  GLMs  and  LLMs  have 

to  be  modified  because  the  conditional  distribution  of  the  observations  and  the 

distribution  of  the  random  effects  are  not  conjugate  and  the  integral  is  analytically 

intractable. 

2.2.1      Direct  and  Indirect  Maximum  Likelihood  Procedures 

In  general,  there  are  two  ways  to  obtain  maximum  likelihood  estimates 

from  the  marginal  likelihood  in  (2.9):  The  first  one  is  a  direct  approach  and 


33 

attempts  to  approximate  the  integral  by  either  analytic  or  stochastic  methods 
and  then  maximize  this  approximation  with  respect  to  the  parameters  (3  and  tj). 
Some  common  analytic  approximation  methods  are  Gauss-Hermite  quadrature 
(Abramowitz  and  Stegun,  1964),  a  first-order  Taylor  series  expansion  of  the 
integrand  or  a  Laplace  approximation  (Tierney  and  Kadane,  1986),  which  is 
based  on  a  second-order  Taylor  series  expansion.  The  two  latter  methods  result 
in  likelihood  equations  similar  to  a  linear  mixed  model  (Breslow  and  Clayton, 
1993;  Wolfinger  and  O'Connell,  1993),  and  by  iteratively  fitting  such  a  model  and 
re-expanding  the  integrand  around  updated  parameter  estimates  one  can  obtain 
approximate  maximum  likelihood  estimates.  However,  these  methods  have  been 
shown  to  yield  estimates  which  can  be  biased  and  inconsistent,  an  issue  which  is 
discussed  in  Lin  and  Breslow  (1996)  and  Breslow  and  Lin  (1995). 

Techniques  using  stochastic  integral  approximations  are  known  under  the  name 
simulated  maximum  likelihood  and  have  been  proposed  by  Geyer  and  Thompson 
(1992)  and  Gelfand  and  Carlin  (1993).  These  methods  approximate  the  integral  in 
(2.9)  by  importance  sampling  (Robert  and  Casella,  1999)  and  are  better  suited  for 
larger  dimensional  integrals  than  analytic  approximations.  Usually,  the  importance 
density  depends  on  the  parameters  to  be  estimated,  and  so  simulated  maximum 
likelihood  is  used  iteratively  by  first  approximating  the  integral  by  a  Monte  Carlo 
sum  with  some  initial  values  for  the  unknown  parameters.  Then,  the  likelihood 
is  maximized  and  the  resulting  parameters  are  used  to  generate  a  new  sample 
form  the  importance  density  in  the  next  iteration.  We  will  briefly  discuss  the  idea 
behind  importance  sampling  in  Section  2.3.2.  The  simulated  maximum  likelihood 
approach  is  also  further  illustrated  in  the  next  section,  where  we  discuss  it  in  the 
context  of  state  space  models. 

An  alternative  to  the  direct  approximating  methods  is  the  EM-algorithm 
(Dempster  et  al.,  1977).  The  integral  in  (2.9)  is  not  directly  maximized  in  this 


rv": 


34 

method,  but  is  maximized  indirectly  by  considering  a  related  function  Q{.  |  .).  At 
each  step  of  this  algorithm,  maximization  of  the  Q-function  increases  the  marginal 
likelihood,  a  fact  that  can  be  verified  using  Jensen's  inequality.  The  EM-algorithm 
relies  on  recognizing  or  inventing  missing  data,  which,  together  with  the  observed 
data,  simplifies  maximum  likelihood  calculations.  For  GLMMs,  the  random  effects 
Ui,...,Un  are  treated  as  the  missing  data.  In  particular,  let  /S^''"^^  and  ^^^~^^ 
denote  current  (at  the  end  of  iteration  A;  —  1)  values  for  parameter  vectors  /3  and 
•0.  Also,  let  y'  =  (y^, . . . ,  y^)  and  u'  =  (wi, . . . ,  u„)  denote  the  vector  of  all 
observations  and  their  associated  random  effects.  Then,  the  Q{.  \  .)  function  at  the 
start  of  iteration  k  has  form 

Q{0,  ^  I  ^^'-'\  t/^C^-^))    =    E  [log  j(y,  m;  /3,  ^)  |  y,  l3^'-'\  -^^^-'^ 

=    E  [log/(y  I  u-p)  I  y,l3^'-'\i,^^-'^]         (2.10) 
+  £;[logg(ri;V)|y,/3('=-^\V'^'-')' 

j{y,u;p,ip)  =  f{y  \  u\^)g{u;\j}) 


where 


denotes  the  joint  density  of  observed  and  missing  data,  also  known  as  the  complete 
data.  The  expectation  in  (2.10)  is  with  respect  to  the  conditional  distribution 
of  u  I  y,  evaluated  at  the  current  parameter  estimates  fi^^~^^  and  ip^''~^\  The 
calculation  of  the  expected  value  is  called  the  E-step  and  it  is  followed  by  an 
M-step  which  maximizes  Q(/3,t/)  |  ^(''"^\'^^''"^')  with  respect  to  /3  and  V-  The 
resulting  estimates  /3^*^^  and  ■0''^^  are  used  as  updates  in  the  next  iteration  to  re- 
calculate the  E-step  and  the  M-step.  Since  (^''^','0^*^^)  is  the  maximizer  at  iteration 
k, 

Q{I3^''\  tp^"^  1 13^''-'\  t/;('-^))  >  Q{(3,  xp  I  ^C^-^),  xj^^"-'^)  for  all  {/S,  V)         (2.11) 
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and  it  follows  that  the  likelihood  increases  (or  at  worst  stays  the  same)  from  one 
iteration  to  the  next: 


-E 


\ogh{u\y;0^'\rP^'^)\y,/3^''-'\i,^'-'^ 


>    Q(/3('=-^),'0('=-i)|/3('=-i),^(^-i)) 


-E 


\ogh{u  I  y;/3('=-^),t/.('=-^))  |  y,^('=-^),  V'^'"^) 


\ogL{/3^'^'\'^^'-'^;y) 


Here  we  used  (2.11)  and  the  fact  that 


E 


-E 


'\ogh{u  I  y;0^'-'\,J,^'-'^)  I  y,^^^-!),  V'('=-^)] 
log  (/i(w  I  y-0^'\tl,^'^)/hiu  I  y;/3('=-i),t/;('=-^)))  |  y,^(*=-^),i/,('=-^) 
<     log  (£;  [h{u  I  y;  /3W,  ^W)//i(w  |  y;  ^('="1),  ^^^-D)  |  y,  ^C^-^,  t/^C^-^)] ) 


E 


=  0, 


where  the  inequality  in  the  last  step  derives  from  Jensen's  inequality.  Under  regu- 
larity conditions  (Wu,  1983)  and  some  initial  starting  values  (/3°,'0°),  the  sequence 
of  estimates  {{^^''\il)^''^)}  converges  to  the  maximum  likelihood  estimators  0,-ij}). 
The  EM-algorithm  is  most  useful  if  replacing  the  calculation  of  the  integral  in 
the  marginal  likelihood  (2.9)  by  the  calculation  of  the  integral  in  the  Q-function 
(2.10)  simplifies  computation  and  maximization.  Unfortunately,  for  GLMMs  the 
integrals  in  (2.10)  are  also  intractable  since  the  conditional  density  of  w  |  y  involves 
the  integral  in  (2.9).  However,  the  EM  algorithm  may  still  be  used  by  approxi- 
mating the  expectation  in  the  E-step  with  appropriate  Monte  Carlo  methods.  The 
resulting  algorithm  is  called  the  Monte  Carlo  EM-algorithm  (MCEM)  and  was 
proposed  by  Wei  and  Tanner  (1990).  We  review  it  in  detail  in  Section  2.3. 
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Some  arguments  favoring  the  use  of  the  MCEM-algorithm  over  direct  methods 
such  as  simulated  maximum  likelihood  for  fitting  GLMMs,  especially  when  some 
variance  components  in  t/j  are  large,  are  given  in  Jank  and  Booth  (2003)  and  Booth 
et  al.  (2001).  Currently,  the  only  available  software  for  fitting  GLMMs  uses  direct 
methods  such  as  Gauss-Hermite  quadrature  or  simulated  maximum  likelihood 
(e.g.,  SAS's  proc  nlmixed).  State  space  models  of  Section  2.1.2  are  also  fitted  via 
simulated  maximum  likelihood.  This  is  discussed  in  Section  2.2.3. 
2.2.2     Model  Fitting  in  a  Bayesian  Framework 

In  a  Bayesian  context,  GLMMs  are  two-stage  hierarchical  models  with  ap- 
propriate priors  on  ^  and  V-  Instead  of  obtaining  maximum  likelihood  estimates 
of  unknown  parameters,  a  Bayesian  analysis  looks  at  their  entire  posterior  dis- 
tributions, given  the  observed  data.  Markov  chain  Monte  Carlo  techniques  avoid 
the  tedious  integrations  in  the  posterior  densities  and  allow  for  relatively  easy 
simulations  from  these  distributions,  compared  with  the  problems  encountered  in 
maximum  likelihood  estimation.  This  suggests  approximating  maximum  likelihood 
estimates  via  a  Bayesian  route,  assuming  improper  or  at  least  very  diffuse  priors 
and  exploiting  the  proportionality  of  the  likelihood  function  and  the  posterior 
distribution  of  the  parameters.  However,  for  many  discrete  data  models,  improper 
priors  may  lead  to  improper  posteriors.  Natarajan  and  McCulloch  (1995)  demon- 
strate this  with  GLMMs  for  correlated  binary  data  assuming  independent  N{0,  o^) 
random  effects  and  a  flat  or  a  non-informative  prior  for  a^ .  Sun,  Tsutakawa  and 
Speckman  (1999)  and  Sun,  Speckman  and  Tsutakawa  (2000)  show  that  with 
noninformative  (flat)  priors  on  fixed  effects  and  variance  components  of  more  com- 
plicated random  effects  distributions,  propriety  of  the  posterior  distribution  cannot 
be  guaranteed  for  a  Poisson  GLMM  when  one  of  the  observed  counts  is  zero,  and 
is  impossible  in  a  logit  link  GLMM  for  binomial  (n,7r)  observations  if  just  one  of 
the  observation  is  equal  to  0  or  n.  Of  course,  the  use  of  proper  priors  will  always 
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lead  to  proper  posteriors.  However,  for  often  employed  diffuse  but  proper  priors, 
Natarajan  and  McCuUoch  (1998)  show  that  even  with  enormous  simulation  sizes, 
posterior  estimates  (such  as  the  posterior  mode)  can  be  far  away  from  maximum 
likelihood  estimates,  which  make  their  use  undesirable  in  a  frequentist  setting. 
2.2.3     Maximum  Likelihood  Estimation  for  State  Space  Models 

The  same  problems  as  in  the  GLMM  case  arise  for  maximum  likelihood  fit- 
ting of  non-Gaussian  state  space  models.  Here,  we  review  a  simulated  maximum 
likelihood  approach  suggested  by  Durbin  and  Koopman  (1997),  using  notation  in- 
troduced in  Section  2.1.2.  Let  p(y  |  a;t/>)  =  YltP{yt  \  cxuxl))  denote  the  distribution 
of  all  observations  given  the  states  and  let  p(a;  ip)  denote  the  distribution  of  the 
states,  where  y  and  a  are  the  stacked  vectors  of  all  observations  and  all  states, 
respectively.  The  vector  ^  holds  parameters  that  may  appear  in  Wt,  Tt  and  Qt. 
Let  p{y,  a;  xj})  denote  the  joint  density  of  observations  and  states.  For  practical 
purposes,  it  is  easier  to  work  with  the  signal  9t  instead  of  the  high  dimensional 
state  vector  a^.  Hence,  let  p{y  \  0-ip),  p(0;  ip)  and  p(y,  0,  t/j)  denote  the  corre- 
sponding conditional,  marginal  and  joint  distributions  parameterized  in  terms  of 
the  signal  9t  =  w[at,  t  =  l,...,T,  where  0  =  (^i, . . . ,  9t)'.  The  observed  likelihood 
is  then  given  by  the  integral 

L(V;  y)  =  Jp{y\  0;  ^)p{0\  ^)d0.  (2.12) 

To  maximize  (2.12)  with  respect  to  V,  Durbin  and  Koopman  (1997,  2000)  first 
calculate  the  likelihood  Lg{tj);y)  for  an  approximating  Gaussian  model  and  then 
obtain  the  true  likelihood  L{tp;  y)  by  an  adjustment  to  it.  However,  two  different 
approaches  of  how  to  construct  the  approximating  Gaussian  model  are  presented 
in  the  two  papers.  In  Durbin  and  Koopman  (1997)  the  approximating  model  is 
obtained  by  assuming  that  observations  follow  a  linear  Gaussian  model 

yt  =  w[oLt  +  et  =  9t  +  Ct, 


'^-•i> 


:■  ^  'J 
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with  tt  ~  N{iJ,fiol).  All  densities  generated  under  this  model  are  denoted  by  p(.). 
The  two  parameters  /xt  and  a'^  are  chosen  such  that  the  true  density  p{y  \  0;  t/j)  and 
its  normal  approximation  g{y  \  0;  ip)  are  as  close  as  possible  in  the  neighborhood 
of  the  posterior  mean  Eg[0  \  y].  The  state  equations  of  the  true  non-Gaussian 
model  and  the  Gaussian  approximating  model  are  assumed  to  be  the  same, 
which  implies  that  the  marginal  density  of  0  is  the  same  under  both  models,  i.e., 
p{0;  ij})  —  g{0;  ip).  The  likelihood  of  the  approximating  model  is  then  given  by 

L,(^;y)=g(y;^)  =  ^JM^  =  ?^y^^p^.  (2.13) 

This  likelihood  is  calculated  using  a  recursive  procedure  known  as  the  Kalman 
filter  (see,  for  instance,  Fahrmeir  and  Tutz,  2001,  Chap.  8).  Alternatively,  the 
approximating  Gaussian  model  is  a  regular  linear  mixed  model  and  maximum 
likelihood  calculations  can  be  carried  out  using  more  familiar  algorithms  in  the 
linear  mixed  model  literature  (see,  for  instance,  Verbeke  and  Molenberghs,  2000). 
From  (2.13), 

and  upon  plugging  in  into  (2.12) 

'p{y  I  0-xl}) 


L{^;y)  =  Lg{iP;y)E, 


9 


(2.14) 


My  I  ^i'^) 

where  Eg  denotes  expectation  with  respect  to  the  Gaussian  density  g{0  \  y,  ■0) 
generated  by  the  approximating  model.  Hence,  the  observed  likelihood  of  the  non- 
Gaussian  model  can  be  estimated  by  the  likelihood  of  an  approximating  Gaussian 
model  and  an  adjustment  factor,  in  particular 

L{i/};y)  =  Lg{'ip-y)w{xjj), 


^•.- 
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where 

p(y|6>«;'0) 


w 


is  a  Monte  Carlo  sum  approximating  the  expected  value  Eg  with  m  random 
samples  0^'^  from  g{e  \  y;i/)).  Normahty  of  g{0  \  y;^)  allows  for  straightforward 
simulation  from  this  density.  ■  "* 

A  different  approach  for  choosing  the  approximating  Gaussian  model  is 
presented  in  Durbin  and  Koopman  (2000).  There,  the  model  is  determined  by 
choosing  a^  and  Qt  of  an  approximating  Gaussian  state  space  model  (2.6)  such  that 
the  posterior  densities  g{e  \  y;  V')  implied  by  the  Gaussian  model  and  p{e  \  y,  i/y) 
implied  by  the  true  model  have  the  same  posterior  mode  0. 

Formally,  by  dividing  and  multiplying  (2.12)  by  the  importance  density  ■      ■  " 

g{0  I  y;^),  we  like  to  interpret  approximation  (2.14)  as  an  importance  sampling 
estimate  of  the  observed  likelihood  and  the  entire  procedure  as  a  simulated 
maximum  likelihood  approach: 

L{^;y)  =    [p{y\e-xi^)  f^^''^\M0\y;^)d0 
-  ^g[y,v)Eg  \--. — I  „   ..   . 

Durbin  and  Koopman  (1997,  2000)  present  a  clever  way  of  artificially  enlarging  the 
simulated  sample  of  0^'^'s  from  the  importance  density  g{0  \  y;  ^)  by  the  use  of 
antithetic  variables  (Robert  and  Casella,  1999).  These  quadruple  the  sample  size 
without  additional  simulation  efforts  and  balance  the  sample  for  location  and  scale.  1 

Overall,  this  leads  to  a  reduction  in  the  total  sample  size  necessary  to  achieve  a  "  i 

certain  precision  in  the  estimates.  •  | 
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In  practice,  it  is  desirable  in  the  maximization  process  to  work  with  log  (Z/('0;  y)). 
Durbin  and  Koopman  (1997,  2000)  present  a  bias  correction  for  the  bias  introduced 
by  estimating  log  {Eg\p{y  \  0;  ip)/g{y  \  9;  V')])-  Finally,  the  resulting  estimator 
of  log  {L{xp;  y))  can  be  maximized  with  respect  to  ^  by  a  suitable  numerical 
procedure,  such  as  Newton-Raphson. 

We  mentioned  before  that  simulated  maximum  likelihood  can  be  computa- 
tionally inefficient  and  suboptimal,  especially  when  some  variance  components 
are  large  (Jank  and  Booth,  2003).  As  we  will  see  in  various  examples  in  Chapter 
5,  large  variance  components  (e.g.,  a  large  random  effects  variance)  are  the  norm 
rather  than  the  exception  with  the  type  of  time  series  models  we  consider.  Next, 
we  will  look  at  an  alternative,  indirect  method  for  fitting  our  models.  In  principle, 
though,  the  methods  just  described  are  also  applicable  to  GLMMs,  through  the 
close  connections  of  GLMMs  and  state  space  models  described  above. 
2.3     The  Monte  Carlo  EM  Algorithm 

In  Section  2.2  we  presented  the  EM-algorithm  as  an  iterative  procedure 
consisting  of  two  components,  the  E-  and  the  M-step.  The  El-step  calculates  a 
conditional  expectation  while  the  M-step  subsequently  maximizes  this  expectation. 
Often,  at  least  one  of  these  steps  is  analytically  intractable  and  in  most  of  the 
applications  considered  here,  both  steps  are.  Numerical  methods  (analytic  and 
stochastic)  have  to  be  used  to  overcome  these  difficulties,  whereby  the  E-step 
usually  is  the  more  troublesome.  One  popular  way  of  approximating  the  expected 
value  in  the  E-step  uses  Monte  Carlo  methods  and  is  discussed  in  Wei  and  Tanner 
(1990),  McCuUoch  (1994,  1997)  and  Booth  and  Robert  (1999).  The  Monte  Carlo 
EM  (MCEM)  algorithm  uses  a  sample  from  the  distribution  of  the  random  effects 
u  given  the  observed  data  y  to  approximate  the  Q-function  in  (2.10).  In  particular, 
at  iteration  k,  let  u^^\  . . . ,  u^"^^  be  a  sample  from  this  distribution,  denoted  by 
h{u  I  y;f3^''''^\ip^''~'^^)  and  evaluated  at  the  parameter  estimates  ygC""^)  and  V^*""^^ 


-t 
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from  the  previous  iteration.  The  approximation  to  (2.10)  is  then  given  by 

^       m  ^       m 

Qm{(3,^  I  ^('=-i),t/,('=-i))  =  _  Vlog/(j/  I  ii(^);/3)  +  -  y;iog^(M(^);t/;).     (2.15) 

As  m  ^  oo,  with  probability  one,  Qm  -^  Q-  The  M-step  then  maximizes  Qm 
instead  of  Q  with  respect  to  /3  and  V  and  the  resulting  estimates  /S^*^^  and  i/j'*^^ 
are  used  in  the  next  iteration  to  generate  a  new  sample  from  h{u  \  y;/?*^,^*^).  If 
maximization  is  not  possible  in  closed  form,  sometimes  only  a  pair  of  values  (/9,  xj}) 
which  satisfies  Q„,(^,'0  I  li^''-^\'4}^''-^^)  >  Qm{/3^''-''\tl^^''-'^  \  ^^'-'^tj^^"-'^),  but 
which  do  not  attain  the  global  maximum,  is  chosen  as  the  new  parameter  update 
{^^  ',  ip^''').  However,  we  show  for  our  models  that  the  global  maximum  can  be 
approximated  in  very  few  steps. 

Maximization  of  Qm  with  respect  to  /3  and  if)  is  equivalent  to  maximizing  the 
first  term  in  (2.15)  with  respect  to  13  only  and  the  second  term  with  respect  to  ■^ 
only.  This  is  due  to  the  two-stage  hierarchy  of  the  response  distribution  and  the 
random  effects  distribution  in  GLMMs  and  is  discussed  next.  Different  approaches 
to  obtaining  a  sample  from  h{u  \  y;  /3,  t/j)  for  the  approximation  of  the  E-step  are 
presented  in  Sections  2.3.2  and  convergence  criteria  are  discussed  in  Section  2.3.3. 
2.3.1      Maximization  of  Qm 

For  now  we  assume  we  have  available  a  sample  u^^\  . . .,  it^*"^  from  h{u  \ 
y;  /3,  ■«/')  or  an  importance  sampling  distribution,  generated  by  one  of  the  mecha- 
nisms described  in  Sections  2.3.2  to  2.3.5.  Let  Q]^  and  Q^  be  the  first  and  second 
term  of  the  sum  in  (2.15).  Using  the  exponential  family  expression  for  the  densities 
fiVit  I  Ui),  at  iteration  k, 

^      m       n      Tii 

Qlifi  I  /3(-^))  a  ^  E E E  \y-(^u  -  bie^)\ ,  (2.16) 

j=i  j=i  t=i 
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where,  according  to  the  GLMM  specifications, 

with  Ui     the  i-th  component  of  the  j-th  sampled  random  effects  vector  w'^^ 

Maximizing  Q^  with  respect  to  /3  is  equivalent  to  fitting  an  augmented  GLM  with 

known  oflfsets:  For  j  =  1, . . . ,  m,  let  yl^^  =  yu  and  x-j^  =  Xu  be  the  random 

components  and  known  design  vectors  for  this  augmented  GLM,  and  let  z^^u'f^  be 

a  known  offset  associated  with  each  ?/|f .  That  is,  we  duplicate  the  original  data 

set  m  times  and  attach  a  known  offset  z'nu\^^  to  each  replicated  observation.  The 

model  for  the  mean  in  the  augmented  GLM,  E[Y^^]  =  //jf  =  h'^x'^^  +  z'^ul^^) 

is  structurally  equivalent  to  the  model  for  the  mean  in  the  GLMM.  Then,  the 

log-likelihood  equations  for  estimating  13  for  the  augmented  GLM  are  proportional 

to  Ql^.  Hence,  maximization  of  Ql^  with  respect  to  /3  follows  along  the  lines  of 

well  known,  iterative  Newton-Raphson  or  Fisher  scoring  algorithms  for  GLMs. 

Denote  by  /3^     the  parameter  vector  after  convergence  of  one  of  these  algorithms. 

It  represents  the  value  of  the  maximum  likelihood  estimator  of  0  at  iteration  k  of 

the  MCEM  algorithm.  , 

The  expression  for  Q^  depends  on  the  assumed  random  effects  distribution. 

Most  generally,  let  E  be  an  unstructured  nq  x  nq  covariance  matrix  for  the  random 

effects  vector  u  =  (uj, . . . ,  u„),  where  q  =  X]"=i  «»  and  Ui  is  the  dimension  of  each 

cluster  specific  random  effect  Wj.  Then,  assuming  u  has  a  mean  zero  multivariate 

normal  distribution  g{u;  xj})  where  ^J  holds  the  \nq{nq  +  1)  distinct  elements  of  E, 

Q"^  has  form 

1     "* 

The  goal  is  to  maximize  Q^  with  respect  to  the  variance  components  V>  of  S.  For 
a  general  E,  the  maximum  is  obtained  at  the  variance  components  of  the  sample 
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covariance  matrix  Sm  =  ^  YJjli  «'•''«(•')'.  Denoting  these  by  ip'^''^  gives  the  value  of 
the  maximum  likelihood  estimator  of  ^/>  at  iteration  k  of  the  MCEM  algorithm. 

The  simplest  structure  occurs  when  random  effects  ttj  have  independent 
components  and  are  i.i.d.  across  all  clusters,  where  g{u;  ip)  is  then  the  product  of 
n  N{0,a'^I)  densities  and  i/j  =  cr.  Q^  at  iteration  k  is  then  maximized  at  cr^*')  = 

/  ,  N    1/2 

( ;^  Yl"=i  w^''^  w^"'' )      •  Many  applications  of  GLMMs  use  this  simple  structure 
of  i.i.d.  random  effects,  where  often  Wj  is  a  univariate  random  intercept.  In  this 
case,  the  estimate  of  a  at  iteration  k  reduces  to  a^''^  =  [^  S^i  5^^=!  ^/    ) 
In  Chapter  3  we  will  drop  the  assumption  of  independence  and  look  at  correlated 
random  effects,  but  with  more  parsimonious  covariance  structures  than  the  most 
general  case  presented  here.  Maximization  of  Q^  with  respect  to  t/?  will  be 
presented  there  on  a  case  by  case  basis. 
2.3.2     Generating  Samples  from  h{u  \  y;  /9,  tp) 

So  far  we  assumed  we  had  available  a  sample  u^^\  ...,  u^"*^  to  approximate 
the  expected  value  in  the  E-step  of  the  MCEM  algorithm.  This  section  describes 
how  to  generate  such  a  sample  from  h{u  \  y,  13,  i/>),  which  is  only  known  up  to  a 
normalizing  constant,  or  from  an  importance  density  g{u).  In  the  following,  we  will 
suppress  the  dependency  on  parameters  ^  and  V,  since  the  densities  are  always 
evaluated  at  their  current  values.  Three  methods  are  presented:  The  accept-reject 
algorithm  produces  independent  samples,  while  Metropolis-Hastings  algorithms 
produce  dependent  samples.  A  detailed  description  of  all  three  methods  can  be 
found  in  Robert  and  Casella  (1999). 
2.3.2.1     Accept-reject  sampling  in  GLMMs 

In  general,  for  accept-reject  sampling  we  need  to  find  a  candidate  density 
g  and  a  constant  M,  such  that  for  the  density  of  interest  h  (the  target  density) 
h{x)  <  Mg{x)  holds  for  all  x  in  the  support  of  h.  The  algorithm  is  then  to 
1.   generate  x  ^  g,  w  ~  Unif  orm[0, 1]; 
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2.  accept  a;  as  a  random  sample  from  h  if  w  <  jf^',  ■         '  -i 

3.  return  to   1.    otherwise;  ,    '-' 
This  will  produce  one  random  sample  x  from  the  target  density  h.  The                      .     ? 

probability  of  acceptance  is  given  by  1/M  and  the  expected  number  of  trials  until  a 
variable  is  accepted  is  M.  s 

4 

For  our  purpose,  the  target  density  is  h{u  \  y).  Since  h{u  \  y)  =  \f{y  \ 
u)g{u)  <  Mg{u),  where  M  =  ^supu/(y  |  u)  and  a  is  an  unknown  normalizing 
constant  equal  to  the  marginal  likelihood,  the  multivariate  normal  random  effects 
distribution  g{u)  can  be  used  as  a  candidate  density.  Booth  and  Robert  (1999, 
Sect.  4.1)  show  that  for  certain  models  sup^  f{y  \  u)  can  be  easily  calculated 
from  the  data  alone  and  thus  need  not  to  be  updated  at  every  iteration.  For  some  ^ 

models  we  discuss  here,  the  condition  of  Booth  and  Robert  (1999,  Sect.  4.1,  page 
272)  required  for  this  simplification  does  not  hold.  However,  the  likelihood  of  a 
saturated  GLM  is  always  an  upper  bound  for  f{y  \u).  To  illustrate,  regard  L(w)  = 
/(y  I  u)  as  the  likelihood  corresponding  to  a  GLM  with  random  components  yu  - 

and  linear  predictor  rju  =  z^^Ui  +  x'^^,  where  now  x-^^  plays  the  role  of  a  known 
offset  and  Wj  are  the  parameters  of  interest.  The  maximized  likelihood  L{u)  for 
this  model  is  always  less  than  the  maximized  likelihood  L{y)  for  a  saturated  model. 
Hence,  sup^^  /(y  |  u)  <  L{u)  <  L{y),  and  L{u)  or  L{y)  can  be  used  to  construct 
M. 

Example:  In  Section  3.1  we  consider  a  data  set  where  conditional  on  a  random 
effect  Ut,  Vit,  the  t-th  observation  in  group  i,  is  modeled  as  a  Binomial (nit,7rjt) 
random  variable.  There  are  16  time  points,  i.e.,  t  =  1, ...  16  and  two  groups  i  = 
1, 2.  A  very  simple  logistic-normal  GLMM  for  these  data  has  form  logit(7ri((?it))  =  , 

a  +  0Xi  +  Ut,  where  Xi  is  a  binary  group  indicator.  The  overall  design  matrix  for 
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this  problem  is  the  32  x  18  matrix 
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where  the  columns  hold  the  coefficients  corresponding  to  a,  p,  Ui,  «2)  •  •  • » ^^le-  All 
rows  of  this  matrix  are  different,  and  as  a  consequence  the  condition  of  Booth  and 
Robert  (1999,  Sect.  4.1,  page  272)  does  not  hold.  However,  the  saturated  binomial 
likelihood  L{y)  is  an  upper  bound  for  f{y  \  w),  i.e., 


sup/(y  I  u)  <  L{y). 
u 

For  instance,  with  the  logistic-normal  example  from  above  with  linear  predictor 
Vit  =  c  +  Ut,  where  c  =  a  +  jSxi  represents  the  fixed  part  of  the  model,  we  have 

i  +  e 


C+Ut 


1 


oC+Ut 


By  first  taking  logs  and  then  finding  first  and  second  derivatives  with  respect  to  Ut, 
we  see  that  ul  =  log  ( i-  .^/n   )  ~  ^  maximizes  this  expression  for  ^  <  yu  <  nu. 
Plugging  in,  we  obtain  the  result 


sup  f{yit\ut)    =       — 


yit\'"  f^_yii^ ''"''" 


riit 


For  the  special  cases  of  yu  =  0  or  y^  =  Uit  the  trivial  bound  on  f{yit  \  ut)  is  1. 
Hence,  the  following  inequality,  which  immediately  follows  from  above  can  be  used 
in  constructing  the  accept-reject  algorithm  for  a  logistic-normal  model  with  linear 


f  '    * 
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predictor  of  form  rjit  =  c  +  Ut- 


nu-Vit 


This  means  we  can  select  M  =  \L{y)  io  meet  the  accept- reject  condition  and 
consequently  we  accept  a  sample  u  from  g{u)  if  for  a  to  ~  Uniform[0, 1]: 

h(u  I  y)  ^  /(y  I  m) 
"^^   Mg{u)  L{y)    ' 

Notice  that  this  condition  is  free  of  the  normalizing  constant  a.  In  practice, 
especially  for  high  dimensional  random  effects,  M  can  be  very  large  and  therefore 
we  almost  never  accept  a  sample.  Two  alternative  methods  described  below 
may  avoid  this  problem.  Note,  however,  that  the  accept-reject  method  yields  an 
independent  and  identical  distributed  sample  from  the  target  distribution.  This  is 
important  if  one  wants  to  implement  an  automated  MCEM  algorithm  (Booth  and 
Hobert  (1999)),  where  the  Monte  Carlo  sample  size  m  is  increased  automatically  as 
the  algorithm  progresses  to  adjust  for  the  error  in  the  Monte  Carlo  approximation 
to  the  E-step. 
2.3.2.2     Markov  chain  Monte  Carlo  methods 

For  high  dimensional  distributions  h{u  \  y),  which  are  unavoidable  if  correlated 
random  effects  are  used,  accept-reject  methods  can  be  very  slow.  An  alternative 
is  to  generate  a  Markov  chain  with  invariant  distribution  h{u  \  y),  which  may 
be  much  faster  but  results  in  dependent  samples.  McCulloch  (1997)  discussed  a 
Metropolis  Hastings  algorithm  for  creating  such  a  chain  for  the  logistic-normal 
regression  case.  In  general,  an  independent  Metropolis  Hastings  algorithm  is  built 
as  follows:  Choose  a  candidate  density  g{u)  with  the  same  support  as  h{u  \  y). 
Then,  for  a  current  state  m^-'"^', 


^- 


^■• 
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1.  Generate  w  ^  g; 

2.  Set  u(^)  equal  to  w  with  probability  p  =  min  (l,  ^^^M^^gjjl^) 
and  equal  to  m'-^"^'  with  probability  1  -  p; 

After  a  sufficient  burn  in  time,  the  states  of  the  generated  chain  can  be 
regarded  as  a  (dependent)  sample  from  h{u  \  y).  If  the  candidate  density  g{u) 
is  chosen  to  be  the  density  of  the  random  effects  g{u),  the  acceptance  probability 
in  step  2  reduces  to  the  simple  form  min  (l,  f{y  \  w)/f{y  |  m^-?-^))).  To  further 
speed  up  simulations,  McCuUoch  (1997)  uses  a  random  scan  algorithm  which  only 
updates  the  k-th  component  of  the  previous  state  u^^"^)  and,  upon  acceptance  in 
step  2,  uses  it  as  the  new  state. 

Another  popular  MCMC  algorithm  is  the  Gibbs  sampler.  Let  w^^'-^)  = 
(m/      , . . . ,  w„^      )  denote  the  current  state  of  a  Markov  chain  with  invariant  distri- 
bution h{u  I  y).  One  iteration  of  the  Gibbs  sampler  generates,  componentwise, 

«?)      ~     hiu,\u?,ut'\...,uli-^),y) 

(j)     ^     h{,.     I  ,.0')  ».(J) 


K'    ~ 


h{Ur,\u[^>,...,U^^\,y), 


where  h{ui  \u['\...,  wji\,  u\X,^\  ...,  v!i-'\  y)  are  the  so  called  full  conditionals  of 

h(u  I  y).  The  vector  w(^)  =  {v!i\  . . . ,  wif))  represents  the  new  state  of  the  chain, 

and,  after  a  sufficient  burn-in  time,  can  be  regarded  as  a  sample  from  h{u  \  y). 

The  advantage  of  the  Gibbs  sampler  is  that  it  reduces  sampling  of  a  possibly  very 

high-dimensional  vector  u  into  sampling  of  several  lower-dimensional  components 

of  w.  We  will  use  the  Gibbs  sampler  in  connection  with  autoregressive  random 

effects  to  simplify  sampling  from  an  initially  very  high-dimensional  distribution  "-I 

K"^  I  y)  by  sampling  from  its  simpler  full  univariate  conditionals.  '■ 
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2.3.2.3     Importance  sampling 

An  importance  sampling  approximation  to  the  Q-function  in  (2.10)  is  given  by 

1      "I  ^      m 

Tfl    .  TTl 

where  u^-?)  are  independent  samples  from  an  importance  density  g{u;  ip^''-'^^)  and 


Wj 


are  importance  weights  at  iteration  A;.  Usually,  Qm  is  divided  by  the  sum  of  the 
importance  weights  YJj'=i'uJj-  The  normalizing  constant  a  only  depends  on  known 
parameters  (/S^'^-^^  V^^"^^)  and  hence  plays  no  part  in  the  following  maximization 
step.  Selecting  the  importance  density  ^  is  a  delicate  issue.  It  should  be  easy 
to  simulate  from  but  also  resemble  h{u  \  y)  as  close  as  possible.  Booth  and 
Robert  (1999)  suggest  a  Student  t  density  as  the  importance  distribution  g,  whose 
mean  and  variance  match  those  of  h{u  \  y)  which  are  derived  via  a  Laplace 
approximation. 
2.3.3     Convergence  Criteria 

Due  to  the  stochastic  nature  of  the  algorithm,  parameter  estimates  of  two  suc- 
cessive iterations  can  be  close  together  just  by  chance,  although  convergence  is  not 
yet  achieved.  To  reduce  the  risk  of  stopping  prematurely,  we  declare  convergence 
if  the  relative  change  in  parameter  estimates  is  less  than  some  ti  for  c  (e.g.,  five) 
consecutive  times.  Let  A^*^'  =  (/3(*=)',  ■0''^'')'  be  the  vector  of  unknown  fixed  effects 
parameters  and  variance  components.  Then  this  condition  means  that 

l^(fc)  _  ^(fc-i) 


max 


<  ei  (2.17) 


ix(fc-i)| 
W       I 

has  to  be  fulfilled  for  c  consecutive  (e.g.,  five)  A;s.  For  any  Xf\  an  exception  to  this 
rule  occurs  when  the  estimated  standard  error  of  that  parameter  is  substantially 
larger  than  the  change  from  one  iteration  to  the  next.  Hence,  at  iteration  A;,  for 
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those  parameters  satisfying 

max     ^-? —-!-     <  €2, 

^     ^    ^var(A!'=^)      '  - 

where  var(Af  ^)  is  the  current  estimate  of  the  variance  of  the  MLE  Aj,  the  relative 
precision  of  criterion  (2.17)  need  not  be  met.  An  estimate  of  this  variance  can 
be  obtained  from  the  observed  information  matrix  of  the  ML  estimator  for  A. 
Louis  (1982)  showed  that  the  observed  information  matrix  can  be  written  in 
terms  of  the  first  (/')  and  second  (/")  derivative  of  the  complete  data  log-likelihood 
Z(A;  y,  u)  =  log  j(y,  w;  A).  Evaluated  at  the  MLE  A,  it  is  given  by 

I{\)  =  -Eu\y   l"{\;y,u)\y    -  var^iy   l'{\;y,u)\y 

An  approximation  to  this  matrix,  at  iteration  k,  uses  Monte  Carlo  sums  with  draws 
from  h{u  \  y,  A^''^)  from  the  current  iteration  of  the  MCEM  algorithm. 

To  further  safeguard  against  stopping  prematurely,  we  use  a  third  convergence 
criterion  based  on  the  Qm  function.  For  deterministic  EM,  the  Q  function  is 
guaranteed  to  increase  from  iteration  to  iteration.  With  MCEM,  because  of 
the  stochastic  approximation  nature,  Q^^  can  be  less  than  Q^~^^  because  of  an 
"unlucky"  Monte  Carlo  sample  at  iteration  k.  Hence,  the  parameter  estimates 
obtained  from  maximizing  Q^^  can  be  a  step  in  the  wrong  direction  and  actually 
decrease  the  value  of  the  likelihood.  To  counter  this,  we  declare  convergence  only 
if  successive  values  of  Q^'  are  within  a  small  neighborhood.  More  importantly, 
however,  is  that  we  accept  the  fc-th  parameter  update  A^'''  only  if  the  relative 
change  in  the  Q^  function  is  larger  than  some  small  negative  constant,  e.g., 

Wfe-1)        >  ^3-  (2.18) 


im 


If  at  iteration  k  (2.18)  is  not  met  and  there  is  reason  to  believe  that  A^*^  decreases 
the  likelihood  and  is  worse  than  the  parameter  update  from  the  previous  iteration, 
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we  repeat  the  A;-th  iteration  with  a  new  and  larger  Monte  Carlo  sample.  Thereby, 
we  hope  to  better  approximate  the  Q-function  and  as  a  result  get  a  better  estimate 
A^  ',  with  Qm-function  larger  than  the  previous  one.  If  this  does  not  happen,  we 
nevertheless  accept  A'*^'  and  proceed  to  the  next  iteration,  possibly  letting  the 
algorithm  temporarily  move  in  a  direction  of  a  lower  likelihood  region.  Otherwise, 
the  Monte  Carlo  sample  size  quickly  grows  without  bounds  at  an  early  stage 
of  the  algorithm.  Furthermore,  at  early  stages,  the  Monte  Carlo  error  in  the 
approximation  of  the  Q  function  can  be  large  and  hence  its  trace  plot  is  very 
volatile. 

Caffo,  Jank  and  Jones  (2003)  go  a  step  further  and  calculate  asymptotic 
confidence  intervals  for  the  change  in  the  Qm-function,  based  on  which  they 
construct  a  rule  for  accepting  or  rejecting  A*''^  They  discuss  schemes  of  how  to 
increase  the  Monte  Carlo  sample  accordingly  and  their  MCEM  algorithm  inherits 
the  ascent  property  of  EM  with  high  probability.  However,  we  feel  that  the  simpler 
criterion  (2.18)  suffices  for  the  examples  considered  here. 

Coupled  with  any  convergence  criterion  is  the  question  of  the  updating  scheme 
for  the  Monte  Carlo  sample  size  m  between  iterations.  In  general,  we  will  use 
m^'')  =  am'^'^~'^\  where  a  >  1  and  m^*^^  is  the  Monte  Carlo  sample  size  at  iteration 
k.  At  early  iterations,  m^*^)  will  be  low,  since  big  parameter  jumps  are  expected 
regardless  of  the  quality  of  the  approximation  and  the  Monte  Carlo  error  associated 
with  it.  Later,  as  more  weight  will  be  put  on  decreasing  the  Monte  Carlo  error  in 
the  approximations,  the  polynomial  increase  guarantees  sufficiently  large  Monte 
Carlo  samples.  Furthermore,  condition  (2.18)  signals  when  an  additional  boost 
in  m^*^)  is  needed  to  better  approximate  the  Q-function  in  this  iteration.  Hence, 
whenever  (2.18)  is  not  met,  we  re-run  iteration  k  with  a  bigger  sample  size  qm'^''\ 
where  q  >  1  is  usually  between  1  and  2. 
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Figure  2-1:  Plot  of  the  typical  behavior  of  the  Monte  Carlo  sample  size  m^*^^  and 
the  Q-function  Q^^  through  MCEM  iterations. 

The  iteration  number  is  shown  on  the  x  -axis.  Plots  are  based  on  the  data  and 
model  for  the  boat  race  data  discussed  in  Chapter  5. 

A  typical  picture  of  the  Monte  Carlo  sample  size  m^*^)  and  the  Q^^  function 
through  the  iterations  of  an  MCEM  algorithm  is  presented  in  Figure  2-1.  The 
increase  in  the  Q^^  function  is  large  at  the  first  iterations,  but  it's  Monte  Carlo 
error  is  also  large  due  to  the  small  Monte  Carlo  sample  size.  The  plot  of  the  Monte 
Carlo  sample  size  m^^^  shows  several  jumps  corresponding  to  the  events  that  the 
Qm   function  actually  decreased  by  more  than  eg  from  one  iteration  to  the  next  and 
we  adjusted  with  an  additional  boost  in  generated  samples.  The  data  and  model 
on  which  this  plot  is  based  on  are  taken  from  the  boat  race  example  analyzed 
and  discussed  in  Chapter  5,  with  convergence  criterions  set  to  ei  =  0.001,  c  =  4, 
62  =  0.003,  €3  =  -0.005,  a  =  1.03  and  q  =  1.05. 

Fort  and  Moulines  (2003)  show  that  with  geometrically  ergodic  (see,  e.g., 
Robert  and  Casella,  1999)  MCMC  samplers,  a  polynomial  increase  in  the  Monte 
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Carlo  sample  size  leads  to  convergence  of  MCEM  parameter  estimates.  However, 
establishing  geometric  ergodicity  is  not  an  easy  task.  Other,  more  sophisticated 
and  automated  Monte  Carlo  sample  size  updating  schemes  are  presented  by  Booth 
and  Hobert  (1999)  for  independent  sampling,  and  Caffo,  Jank  and  Jones  (2003),  for 
independent  and  MCMC  sampling. 


CHAPTER  3 
CORRELATED  RANDOM  EFFECTS 

In  Chapter  2  we  mentioned  at  several  occasions  that  for  certain  data  structures 
the  usual  assumption  of  independent  random  effects  is  inappropriate.  For  instance, 
if  clusters  represent  time  points  in  a  study  over  time,  observations  from  different 
clusters  can  no  longer  be  assumed  (marginally)  independent.  Or,  in  longitudinal 
studies,  the  non-negative  and  exchangeable  correlation  structure  among  repeated 
observations  implied  by  a  single  random  effect  can  be  far  from  the  truth  for  long 
sequences  of  repeated  observations.  Section  3.1  presents  data  from  a  cross-sectional 
time  series  which  motivates  the  use  of  correlated  random  effects  and  discusses  their 
implications.  In  Sections  3.2  and  3.3  two  special  correlation  structures  useful  for 
modeling  the  dependence  structure  in  discrete  repeated  measures  with  possibly 
unequally  spaced  observation  times  are  discussed.  The  main  focus  of  this  chapter 
is  on  the  technical  implications  on  the  MCEM  algorithm  arising  from  estimating 
an  additional  variance  (correlation)  component.  In  contrast  to  models  with 
independent  random  effects,  the  M-step  has  no  closed  form  solution  and  iterative 
methods  have  to  be  used  to  find  the  maximum.  Also,  because  random  effects  are 
correlated  a  priori,  they  are  correlated  a  posteriori,  and  sampling  from  the  posterior 
distribution  of  u  |  y  as  required  by  the  MCEM  algorithm  is  more  involved  than 
with  independent  random  effects.  A  Gibbs  sampling  approach  is  developed  in 
Section  3.4. 

From  here  on  we  let  t  denote  the  index  for  the  discrete  observation  times, 
t  =  1, . . .  ,T,  and  we  let  Yu  denote  a  response  at  time  point  t  for  strata  i,  i  = 
1, . . . ,  n.  Throughout,  we  will  assume  univariate  but  correlated  random  effects 
{ut}J:^i  associated  with  the  observations  over  time. 

53 


54 

3.1      A  Motivating  Example:  Data  from  the  General  Social  Survey 

The  basic  purpose  of  the  General  Social  Survey  (GSS),  conducted  by  the 
National  Opinion  Research  Center,  is  to  gather  data  on  contemporary  Amer- 
ican society  in  order  to  monitor  and  explain  trends  and  constants  in  atti- 
tudes, behaviors  and  attributes.  It  is  only  second  to  the  census  in  popularity 
among  sociologist  as  a  data  source  for  conducting  research.  The  GSS  ques- 
tionnaire contains  a  standard  core  of  demographic  and  attitudinal  variables 
whose  wording  is  retained  throughout  the  years  to  facilitate  time  trend  stud- 
ies. (Source:  www.norc.uchicago.edu/projects/gensocl.asp).  Currently,  the 
GSS  comprises  a  total  of  24  surveys  conducted  in  the  years  1973-1978,  1980, 
1982,  1983-1994,  1996,  1998,  2000  and  2002,  with  data  available  online  (at 
www.webapp.icpsr.umich.edu/GSS/)  through  1998.  The  two  features,  a  dis- 
crete response  variable  (most  of  the  attitude  questions)  observed  through  time 
and  unequally  spaced  observation  times  make  it  a  prime  resource  for  applying  the 
models  proposed  in  this  dissertation.  Data  obtained  from  the  GSS  are  different 
from  longitudinal  studies  where  subjects  are  followed  through  time.  Here,  responses 
are  from  independent  cross-sectional  surveys  of  different  subjects  in  each  year. 

One  question  included  in  16  of  the  22  surveys  till  1998  recorded  attitude 
towards  homosexual  relationships.  It  was  observed  in  the  years  1974,  1976-77, 
1980,  1982,  1984-85,  1987-1991,  1993-94,  1996  and  1998.  We  will  use  this  data 
to  motivate  and  illustrate  the  use  of  correlated  random  effects.  Figure  3-1  shows 
the  proportion  of  respondents  who  agreed  with  the  statement  that  homosexual 
relationships  are  not  wrong  at  all  for  the  two  race  cohorts  white  respondents  and 
black  respondents.  For  simplicity  in  this  introductory  example,  only  race  was 
chosen  as  a  cross-classifying  variable  and  attitude  was  measured  as  answering  "yes" 
or  "no"  to  the  aforementioned  question.  Let  Yu  denote  the  number  of  people  in 
year  t  and  of  race  i  who  agreed  with  the  statement  that  homosexual  relationships 
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Figure  3-1:  Sampling  proportions  from  the  GSS  data  set. 
Proportion  of  whites  (squares)  and  blacks  (circles)  agreeing  with  the  statement  that 
homosexual  relationships  are  not  wrong  at  all,  from  1974  to  1998. 

are  not  wrong  at  all.  The  index  t  =  1, . . . ,  16  runs  through  the  set  of  16  years 
{1974, 1976, 1977, 1980, . . . ,  1998}  mentioned  above,  and  i  =  1  for  race  equal  to 
white  and  i  =  2  for  race  equal  to  black.  The  conditional  independence  assumption 
discussed  in  Section  2.1  allows  us  to  model  Ya,  the  sum  oi  nu  binary  variables 
which  are  the  individual  responses,  as  a  binomial  variable  conditional  on  a  yearly 
random  effect.  That  is,  the  probabilistic  model  we  propose  assumes  a  conditional 
Binomial(njt,  TTjt)  distribution  for  each  member  of  the  two  time  series  {^it}(=i  and 
{^2«}l£i  pictured  in  Figure  3-1.  The  parameters  nu  and  ttu  are  the  total  number 
and  the  conditional  probability  of  agreeing  with  the  statement  that  homosexual 
relationships  are  not  wrong  at  all,  respectively,  of  respondents  of  race  i  in  year  t. 
3.1.1     A  GLMM  Approach  " 

A  popular  model  for  TTjt  is  a  logistic-normal  model  for  which  the  link  function 
h{.)  in  (2.2)  is  the  logit  link  and  the  random  effects  structure  simplifies  to  a  ran- 
dom intercept  Ut.  We  will  assume  that  the  fixed  parameter  vector  /3  is  composed 


m 


■  n'.-^''  .;■ 


of  an  intercept  term  a,  linear  and  quadratic  time  effects  Pi  and  ^2,  a  race  effect  ^3 
and  a  year-by-race  interaction  P4.  Witli  xu  representing  the  year  variable  centered 
around  1984  (e.g.,  xn  =  1974  -  1984  =  -10)  and  xai  the  indicator  variable  for  race 
(for  whites  xgi  =  0,  for  blacks  3:22  =  1),  the  model  has  form 

\0git{7Tu{Ut))  =  logit(P(y;t  =  yn  I  Ut))  =  a+PiXu+l32xlt+/33X2i+/34XuX2i  +  Ut.    (3.1) 

Apart  from  the  fixed  effects,  the  random  time  effect  Ut  captures  the  dependency  ^  -      f 

structure  over  the  years.  Note  that  7rji(wi)  is  a  conditional  probability,  given  the 
random  effect  Ut  from  the  year  the  question  was  asked.  This  random  effect  Ut  can 
be  interpreted  as  the  unmeasurable  public  opinion  about  homosexual  relationships, 
common  to  all  respondents  within  the  same  year.  By  introducing  this  random 
effect,  we  assume  that  individual  opinions  are  influenced  by  this  overall  opinion 
or  the  social  and  political  climate  on  homosexual  relationships  (like  awareness  of 
AIDS  and  the  social  spending  associated  with  it,  which  is  hard  to  measure).  Thus, 
individual  responses  within  a  given  year  are  no  longer  independent  of  each  other, 
but  share  a  common  random  effect.  Furthermore,  it  is  natural  to  assume  that  the 
public  opinion  about  homosexual  relationships  changes  gradually  over  time,  with 
higher  correlations  for  years  closer  together  and  lower  correlations  for  years  further 
apart.  It  would  be  wrong  and  unnatural  to  assume  that  the  public  opinion  (or 
political  climate)  is  independent  from  one  year  to  the  next.  However,  this  would 
be  assumed  by  modeling  the  random  effects  {«<}  as  independent  of  each  other.  It 
would  also  be  wrong  to  assume  a  common,  time-independent  random  effect  u  =  Ut 
for  all  time  points  t,  as  this  implies  that  public  opinion  does  not  change  over  time. 
It's  effect  would  then  be  the  same,  whether  responses  are  measured  in  1974  or  1998. 
3.1.2     Motivating  Correlated  Random  Effects  ^ 

To  capture  the  dependency  in  public  opinion  and  therefore  in  responses  over  ■ 

different  years,  we  propose  random  effects  that  are  correlated.  In  particular,  *' 
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for  this  example  with  unequally  spaced  observation  times,  we  suggest  normal 
autocorrelated  random  effects  {uj  with  variance  function 

va.r{ut)  =cr^,  i  =  1,...,16 
and  correlation  function 

coTT{ut,Ut')  =  pl^i'-^i<*l,  1  <  t  <  t*  <  16, 

where  Xu  -  Xw  is  the  difference  between  the  two  years  identified  by  indices  t  and 
t*.  This  is  equivalent  to  specifying  a  latent  autoregressive  process 

underlying  the  data  generation  mechanism.  Both  of  these  formulations  naturally 
handle  the  multiple  gaps  in  the  observed  time  series.  There  is  no  need  to  make 
adjustments  (such  as  imputation  of  data  or  artificially  treating  the  series  as  equally 
spaced)  in  our  analysis  due  to  "missing"  data  at  years  1975,  1978-79,  1981,  1983, 
1986,  1992,  1995  or  1997. 

With  correlated  random  effects,  we  have  to  distinguish  between  two  situations: 

•  The  correlation  induced  by  assuming  a  common  random  effect  Ut  for  each 
cluster  (here:  year)  and 

•  the  correlation  induced  by  assuming  a  correlation  among  the  cluster-specific  '    '  "^ 
random  effects  {ut}. 

Correlation  among  observations  in  the  same  cluster  is  a  consequence  of  assuming  a 

single,  cluster-specific  random  effect  shared  by  all  observation  in  that  cluster.  For 

example,  the  presence  of  the  cluster  specific  random  effect  Ut  in  (3.1)  leads  to  a 

(marginal)  non-negative  correlation  among  the  two  binomial  responses  Yu  and  Y2t 

in  year  t.  With  conditional  independence,  the  marginal  covariance  between  these  ] 
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two  observations  is  given  by 

cov(Fi,,r2,)    =    E[cow{Yn,Y2t\ut)] 

+coy  {E[Yu\ut],E[Y2t\ut]) 
■         =    CON  {\og\i'^  {flu +  Ut),\og\t~^{f}2t  +  Ut)),  (3.2) 

where  fja  is  the  fixed  part  of  the  Unear  predictor  in  (3.1).  Both  functions  in  (3.2) 
are  monotone  increasing  in  Ut,  leading  to  a  non-negative  correlation.  Approxima- 
tions to  (3.2)  will  be  dealt  with  in  Section  4.3. 

In  the  example,  we  attributed  the  cause  of  this  correlation  to  the  current 
(at  the  time  of  the  interview)  public  opinion  about  homosexual  relationships, 
influencing  all  respondents  in  that  year.  The  estimate  of  a  gives  an  idea  about  the 
magnitude  of  this  correlation,  since  the  more  disperse  the  itj's  are,  the  stronger 
the  correlation  among  the  responses  within  a  year.  For  instance,  if  the  true  Ut 
for  a  particular  year  is  positive  and  far  away  from  zero,  as  measured  by  a,  then 
all  respondents  have  a  common  tendency  to  give  a  positive  answer.  If  it  is  far 
away  from  zero  on  the  negative  side,  respondents  have  a  common  tendency  for 
a  negative  answer.  This  interpretation,  of  course,  is  only  relative  to  other  fixed 
effects  included  in  the  linear  predictor.  For  the  GSS  data,  there  seems  to  be 
moderate  correlation  between  responses,  based  on  a  maximum  likelihood  estimate 
oi  a  —  0.10  with  an  approximated  asymptotic  s.e.  of  0.03.  This  interpretation  of 
a  moderate  eff'ect  of  public  opinion  on  responses  within  the  same  year  is  further 
supported  by  the  fact  that  a  can  also  be  interpreted  as  the  regression  coefficient  for 
a  standardized  version  of  the  random  effect  Ut.  A  regression  coefficient  of  0.10  for 
a  standard  normal  variable  on  the  logit  scale  leads  to  moderate  heterogeneity  on 
the  probability  scale.  This  shows  that  the  correlation  between  responses  within  a 
common  year  cannot  be  neglected. 
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The  second  consequence  of  correlated  random  effects  is  that  observations  from 
different  clusters  are  correlated,  which  is  a  distinctive  feature  compared  to  GLMMs 
assuming  independence  between  cluster-specific  random  effects.  The  conditional  log 
odds  of  agreeing  with  the  statement  that  homosexual  relationships  are  not  wrong 
at  all  are  now  correlated  over  the  years,  a  feature  which  is  natural  for  a  time  series 
of  binomial  observations  but  would  have  gone  unaccounted  for  if  time-independent 
random  effects  were  used.  For  instance,  for  the  cohort  of  white  respondents  (^  =  1), 
the  correlation  between  the  conditional  log  odds  at  years  t  and  t*  is 

corr  (logit(7ru(Mt)),logit(7rit.(?X(.))  =  p'^^'-^^'*' 

and  therefore  directly  related  to  the  assumed  random  effects  correlation  structure. 
Marginally,  the  two  binomial  responses  at  the  different  observation  times  have 
covariance 

cov(yit,  FatO  =  cov(logit"^(r7it  +  ut),  logit~^(^it.  +  «<.)), 

which  accommodates  changing  covariance  patterns  for  different  observation  times 
(e.g.,  decreasing  with  increasing  lag)  and  also  negative  covariances  (see  for  instance 
the  analysis  of  the  Old  Faithful  geyser  eruption  data  in  Chapter  5).  We  will  present 
approximations  to  these  marginal  correlations  in  binomial  time  series  in  Section 
4.3.  '.  ^ 

Summing  up,  correlated  random  effects  give  us  a  means  of  incorporating 
correlation  between  sequential  binomial  observations  that  go  beyond  independent 
or  exchangeable  correlation  structures. 

In  our  example,  we  attributed  the  sequential  correlation  to  the  gradual  change 
in  public  opinion  about  homosexual  relationships  over  the  years,  affecting  both 
races  equally.  In  fact,  the  maximum  likelihood  estimate  of  p  is  equal  to  0.65  (s.e. 
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0.25),  indicating  that  rather  strong  correlations  might  exist  between  responses  from 
adjacent  years. 

The  model  uses  7  parameters  (5  fixed  effects,  2  variance  components)  to 
describe  the  32  probabilities.  In  comparison  with  a  regular  GLM  and  a  GLMM 
with  independent  random  time  effects,  the  maximized  likelihoods  decrease  from 
-113.0  for  the  regular  GLM  to  approximately  -109.7  for  a  GLMM  with  independent 
random  effects  to  approximately  -107.5  for  the  GLMM  with  autoregressive  random 
effects.  Note  that  the  GLM  assumes  independent  observations  within  and  between 
the  years  and  that  the  GLMM  with  independent  random  effects  {mJ  for  each  year 
t  assumes  correlation  of  responses  within  a  year,  but  independence  of  responses 
over  the  years.  Both  assumptions  might  be  inappropriate.  Our  model  implies  that 
the  log  odds  of  approval  of  homosexual  relationships  are  correlated  for  blacks  and 
whites  within  a  year  (though  not  very  strong  with  an  estimate  of  a  equal  to  0.1) 
and  are  also  correlated  for  two  consecutive  years. 

The  estimates  of  the  fixed  parameters  and  their  asymptotic  standard  errors  are 
given  in  Table  5-1.  The  MCEM  algorithm  converged  after  128  iterations  with  a 
starting  Monte  Carlo  sample  size  of  50  and  a  final  Monte  Carlo  sample  size  of  8600. 
Convergence  parameters  (conf.  Section  2.3.3)  were  ci  =  0.002,  c  =  5,  €2  =  0.003, 
C3  =  -0.001,  a  =  1.01  and  q  =  1.2.  Path  plots  of  selected  parameter  estimates 
for  two  different  sets  of  starting  values  are  shown  in  Figure  3-2.  A  detailed 
interpretation  of  the  parameters  and  the  effects  of  the  explanatory  variables  on  the 
odds  of  approval  is  provided  in  Section  4.3. 

Although  this  example  assumed  autocorrelated  random  effects,  we  will  look 
at  the  simpler  case  of  equally  correlated  random  effects  next.  Then,  we  discuss 
how  the  correlation  parameter  p  can  be  estimated  within  the  MCEM  framework 
presented  in  Section  2.3.  Summing  up,  correlated  random  effects  in  GLMMs  allow 
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Figure  3-2:  Iteration  history  for  selected  parameters  and  their  asymptotic  standard 
errors  for  the  GSS  data. 

The  iteration  number  is  plotted  on  the  x-axis.  The  estimates  and  standard  errors 
for  p2  were  multiplied  by  10^  for  better  plotting.  The  two  different  lines  in  each 
plot  correspond  to  two  different  sets  of  starting  values. 
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one  to  model  within-cluster  as  well  as  between-cluster  correlations  for  discrete 
response  variables,  where  clusters  refer  to  grouping  of  responses  in  time. 
3.2     Equally  Correlated  Random  Effects 

The  introductory  example  modeled  decaying  correlation  between  cross- 
sectional  data  over  time  through  the  use  of  autocorrelated  random  effects.  In  other 
temporal  or  spatial  settings,  the  correlation  might  stay  nearly  constant  between  any 
two  observation  times,  regardless  of  time  or  location  differences  between  the  two 
discrete  responses.  Equally  correlated  random  effects  might  then  be  appropriate  to 
describe  such  a  behavior. 
3.2,1      Definition  of  Equally  Correlated  Random  Effects 

We  call  random  effects  equally  correlated  if 

var(Mj)  =  a^  for  all  t 

and 

corr(M(,  Uf)  =  p  for  all  t  ^t*. 

More  generally  the  covariance  matrix  of  the  random  effects  vector  u  =  {ui,...,  ut)' 
is  given  by  E  =  a^  [(1  -  p)/^  +  pj^],  where  Jt  =  ItI't-  To  ensure  positive 
definiteness,  p  has  restricted  range,  i.e.,  1  >  p  >  -1/{T  -  1).  The  random  effects 
density  is  given  by 

9{u;iP)  (X  \i:\-^/^ exp  l-^u"E-'u\  ,  (3.3) 

where  now  due  to  the  pattern  in  E,  |E|  =  cr^^(l  -  p)^"^[l  +  (T  -  l)p]  and 

^'^  ~  7^    (W)-^^  ~  i+(T-i)p  ^T  ■  The  vector  xj)  =  (a,  p)  holds  the  variance 
components  of  E. 

The  more  complicated  random  effects  structure  (as  compared  to  independence 

or  a  single  latent  random  effect)  leads  to  a  more  complicated  M-step  in  the  MCEM 

algorithm  described  in  Section  2.3.  For  a  sample  u^^\  ...,  u'^'^^  from  the  posterior 

h{u  I  y;/3^''~^\ip^''~'^^)  evaluated  at  the  previous  parameter  estimate  yS^*^"^)  and 
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ij)^'''^',  the  function  <5m(^)  introduced  in  Section  2.3.1  has  form 

QlW^-J2^ogg{u^^^■,^|^)    oc    -Tioga  -  ^log(l  -  p)  -  ^ log [1  +  (T  -  l)p] 

777'  ^  ^ 

1       \-'         p(l-p)        , 


2a2(l-p)  2a2[l  +  (T  -  l)p] 

where  a  =  ^  YIT-^  w^''^  w^-'^  and  b  =  —  YI^n-i  w^"'^  JtU^^^  are  constants  depending  on 

the  sample  only. 

3.2.2     The  M-step  with  Equally  Correlated  Random  Effects 

The  M-step  seeks  to  maximize  Q^  with  respect  to  a  and  p,  which  is  equivalent 
to  finding  their  MLEs  treating  the  sample  u^^\  . . . ,  u^"^^  as  independent.  Since  this 
is  not  possible  in  closed  form,  one  way  to  maximize  Q^  uses  a  bivariate  Newton- 
Raphson  algorithm  with  the  Hessian  formed  by  the  second  order  partial  and  mixed 
derivatives  of  Q"^  with  respect  to  a  and  p.  Some  authors  (e.g.,  Lange,  1995,  Zhang, 
2002)  use  only  a  single  iteration  of  the  Newton-Raphson  algorithm  instead  of  an 
entire  M-step  to  speed  up  convergence.  However,  this  might  not  always  lead  to 
convergence,  since  the  interval  for  which  the  Newton-Raphson  algorithm  converges 
is  restricted  through  the  restrictions  on  p.  We  show  now  that  with  a  little  bit  of 
work  the  maximizers  for  a  and  p  can  be  obtained  very  quickly. 

For  any  given  value  of  p,  the  ML  estimator  for  a  (at  iteration  k)  is  available  in 
closed  form  and  is  equal  to 

Note  that  if  p  =  0,  a^''^  =  (  ^  YlT=^i  ^^^^  ^^''^ )      >  ^^^  estimator  for  the  independence 
case  presented  at  the  end  of  Section  2.3.1.  Unfortunately,  the  ML  estimator  for 
p  has  no  closed  form  solution.  The  first  and  second  partial  derivative  of  Q^  with 
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respect  to  p  are  given  by 

d_^2     ^  T{T-l)p 1  ^  ,   l-2p-p^(T-l) 

dp^""  2(1  -  p)[l  +  {T-  l)p]       2o\\  -  pY  2a2[l  +  (T  -  l)p]2 


^^2    ^   7-(r  -  1) 
dp"^""  2 


i-p(4-2r)  +  p^(i-3r) 

(l-p)2[l  +  (T-l)p]2 

6. 


(72(1  -  p)3 


(72[1  +  (T  -  l)p]3 

We  obtain  the  profile  likelihood  for  p  by  plugging  the  MLE  a^^"*  into  the  likelihood 

equation  for  p.  Then  we  use  a  simple  and  fast  interval-halving  (or  bisection)  ' 

method  to  find  the  root  for  p.  This  is  advantageous  compared  to  a  Newton-  '    <^ 

Raphson  algorithm  since  the  range  of  p  is  restricted.  Let  /(p)  =  ■§zQ^  lCT=a(*') 

and  let  pi  and  p^  be  two  initial  estimates  in  the  appropriate  range,  satisfying 

Pi  <  p2  and /(pi)/(p2)   <  0.  Without  loss  of  generality,  assume /(pi)   <  0. 

Clearly,  the  maximum  likelihood  estimate  p  must  be  in  the  interval  [pi,p2]-  The 

interval-halving  method  computes  the  midpoint  pa  =  (pi  -I-  p2)/2  of  this  interval 

and  updates  one  of  its  endpoints  in  the  following  way:  It  sets  pi  =  Pa  if  /(ps)  <  0 

or  p2  =  pz  otherwise.  The  newly  formed  interval  [pi,P2]  has  half  the  length  of  the 

initial  interval,  but  still  contains  p.  Subsequently,  a  new  midpoint  pz  is  calculated, 

giving  rise  to  a  new  interval  with  one  fourth  of  the  length  of  the  initial  interval, 

but  still  containing  p.  This  process  is  iterated  until  |/(p3)|  <  e,  where  e  is  a  small 

positive  constant.  To  ensure  it  is  a  maximum  we  can  check  that  the  value  of  the 

second  derivative,  /'(p)  is  negative  at  pa.  (The  second  derivative  is  also  needed 

for  approximating  standard  errors  in  the  EM  algorithm.)  The  value  of  pz  is  then 

used  as  an  update  for  p  in  the  maximum  likelihood  estimator  for  (t,  and  the  whole 

process  of  finding  the  roots  of  Q^  is  repeated.  Convergence  is  declared  when  the  } 

relative  change  in  a  and  p  is  less  than  some  pre-specified  small  constant.  The 

values  of  a  and  p  at  this  final  iteration  are  the  estimates  a^^^  and  p^*^)  from  MCEM  ; 

iteration  k.  *^ 
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The  issue  of  how  to  obtain  a  sample  u*^', . . . ,  u^"")  from  h{u  \  y,  ^^''"^^  •0^''"^'), 
taking  into  account  the  special  structure  of  the  random  effects  distribution  will  be 
discussed  in  Section  3.4.2. 

3.3     Autoregressive  Random  Effects 

The  use  of  autoregressive  random  effects  was  demonstrated  in  the  introductory 
example.  Their  property  of  a  decaying  correlation  function  make  them  a  useful 
tool  for  modeling  temporal  or  spatial  associations  among  discrete  data.  We  will 
limit  ourselves  to  instances  where  there  is  a  natural  ordering  of  random  effects,  and 
consider  time  dependent  data  first. 
3.3.1      Definition  of  Autoregressive  Random  Eflfects 

As  with  equally  correlated  random  effects  in  Section  3.2,  we  can  look  at  the 
joint  distribution  of  autoregressive  (or  autocorrelated)  random  effects  {ut}J^i  as  a 
mean-zero  multivariate  normal  distribution  with  patterned  covariance  matrix  S, 
defined  by  the  variance  and  correlation  functions 

var(u()  =  a^  for  all  t 

and 

covv{ut,  ut')  =  pl^'-^'*l   for  all  t^t*, 

where  Xt  and  Xf  are  time  points  (e.g.,  years,  as  in  the  GSS  example)  associated 
with  random  effects  Ut  and  Uf .  Let  dt  =  Xt+i  —  Xt  denote  the  time  difference 
between  two  successive  time  points  and  let  /<  =  1/(1  —  P^'^'),  t  =  1,. . . ,T  —  1. 
Then,  due  to  the  special  structure,  the  determinant  of  the  covariance  matrix  is 
given  by  |E|  =  cr^^H^/Zr^  ^^^  ^~^  is  tri-diagonal  (Crowder  and  Hand,  1990, 
with  correction  of  a  typo  in  there)  with  main  diagonal 

1 


cr2 


(/l,  A  +  /2  -  1,  /l  +  /2  -  1,  /2  +  /3  -  1,  /3  +  /4  -  1,  •  ■  •  ,  /t-2  +  h-l  "  1,  /t-i) 
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and  sub-diagonals  '" 

-^(/l/^/2p'^•••,/T-lP''^-0■  ■ 

For  a  sample  u*^), . . . , «("»)  from  the  posterior  h{u  \  y;  l3'-''~'^\  ■0(*-^^)  evaluated  at 
the  previous  parameter  estimates  /S^''"^)  and  V^*""^^  =  (a^*^-^),^^*^-^)),  the  function 
Q^  (cf.  Section  2.3.1)  now  has  form  '  .v.         '  >  ;, 

Q^W    (X    -Tioga -^^loga-p^'^')- -LI  Vn?)'  (3.4) 

2  f" i'  2(7''  m  ■'^^ 

t=i  j=i 


1  i^^Hi'i-p''«F^ 


2^2^      1 


2 


where  Uf    is  the  ^th  component  of  the  j-th  sampled  vector  u^^\  In  the  M-step  of 
an  MCEM  algorithm,  we  seek  to  maximize  Q"^  with  respect  to  a  and  p. 

Alternatively,  we  can  view  the  random  effects  {uj  as  a  latent  first-order 
autoregressive  process:  Random  effect  u^+i  at  time  ^  + 1  is  related  to  its  predecessor 
Ut  by  the  equation  ...   ,   . 

«t+i  =  p''*ut  +  et,  tt  ~  7V(0,  (72(1  -  p2dt))^  f  =  1, . . . ,  T  -  1,  -.  -  c       (3.5) 

where  dt  again  denotes  the  lag  between  the  two  successive  time  points  associated 
with  random  effects  Ut  and  Ut+i.  Assuming  a  N{{),a^)  distribution  for  the  first 
random  eflFect  mi,  the  joint  random  effects  density  for  w  =  (ui, . . . ,  ur)  enjoys  a 
Markov  property  and  has  form  -  •■ 

,  „  .  -.       .  ':''".'-":■.  ■  ■'''.:v.' 

g{u;  i/y)    =    g{ui;ip)  g{u2  |  Mi;  t/^)  •  •  •  g{ut  \  Ut-u  tp)  ■  ■  ■  g{uT  \  ut-i;iP)  (3.6) 

-  (i)"(n^)-°..{S)»„{g^S^ 

leading,  of  course,  to  the  same  expression  for  Q^  as  given  in  (3.4).  For  two  time 
indices  t  and  t*  with  t  <t*,  the  random  process  has  autocorrelation  function 

p{t,t*)  =  COrT{ut,Ut')  =  /9^*='   ***. 
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Before  we  discuss  maximization  of  Q^  in  this  setting  with  possibly  unequally 
spaced  observation  times,  let  us  comment  on  the  rather  unusual  parametrization  of 
the  latent  random  process  (3.5).  Chan  and  Ledolter  (1995),  in  their  development 
for  time  series  models  of  equally  spaced  discrete  events  use  the  more  common  form 

ut+i  =  put  +  €t,  ct  ~  N{0, a^),  t^l,...,T-l. 

This  leads  to  var(wt)  =  o-V(l  -  p^)  for  all  t  if  we  assume  a  N{0,a^/{1  -  p^)) 
distribution  for  ui.  (Chan  and  Ledolter  (1995)  condition  on  this  first  observation, 
which  leads  to  closed  form  solutions  for  both  a  and  p  in  the  case  of  equidistant 
observations).  Since  it  is  common  practice  to  let  a"^  describe  the  strength  of 
association  between  observation  in  a  common  cluster  sharing  that  random  effect, 
our  parametrization  seems  more  natural.  In  Chan  and  Ledolter's  parameterizations 
both  the  variance  and  correlation  parameter  appear  in  the  variance  of  the  random 
effect. 

In  the  more  general  case  of  unequally  spaced  observations,  the  parametrization 
et  ~  A''(0,(t2)  results  in  different  variances  of  the  random  effects  at  different  time 
points  (i.e.,  var(ut)  =  <tV(1  -  p^^')).  Considering  that  the  random  effects  represent 
unobservable  phenomena  common  to  all  clusters,  their  variability  should  be  about 
the  same  for  all  clusters,  and  not  depend  on  the  time  difference  between  any  two 
clusters.  There  is  no  reason  to  believe  that  the  strength  of  association  is  larger  in 
some  clusters  and  weaker  in  others.  Therefore,  the  parametrization  we  choose  in 
(3.5)  seems  natural  and  appropriate. 

For  spatially  correlated  data,  a  relationship  between  random  effects  Ui  and 
Ui-  is  defined  in  terms  of  a  distance  function  d{xi,  Xj.)  between  covariates  Xi 
and  ajj.  associated  with  them.  Each  Ui  then  represents  a  random  effect  for  a 
spatial  cluster,  and  correlated  random  effects  are  again  natural  to  model  spatial 
dependency  among  observations  in  different  clusters.  In  the  time  setting,  we  had 
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d{xi,Xi-)  =  \xi  —  Xi'\  with  Xi  and  Xj.  representing  time  points.  In  2-dimensional 
spatial  settings,  Xt  =  {xii,Xi2)'  may  represent  midpoints  in  a  Cartesian  system 
and  d{xi,Xi')  =  ||a;i  —  ajj. ||  is  the  Euclidian  distance  function.  The  so-defined 
distance  between  clusters  can  be  used  in  a  model  with  correlations  between  random 
eflfects  decaying  as  distances  between  cluster  midpoints  grow,  e.g.,  coTr{ui,Ui')  = 
p\\Xi-Xi.\\_  Models  of  this  form  are  discussed  in  Zhang  (2002)  and  in  Diggle  et  al. 
(1998)  in  a  Bayesian  framework. 

Sometimes  only  the  information  concerning  whether  clusters  are  adjacent  to 
each  other  is  used  to  form  the  correlation  structure.  In  this  case,  d{xi,  jcj.)  is  a 
binary  function,  indicating  if  clusters  i  and  i*  are  adjacent  or  not.  Usually,  this 
leads  to  an  improper  joint  distribution  for  the  random  effects,  as  for  instance  in 
the  analysis  of  the  Scottish  Lip  cancer  data  set  presented  in  Breslow  and  Clayton 
(1993). 
3.3.2     The  M-step  with  Autoregressive  Random  Eflfects 

Maximizing  Q^  with  respect  to  a  and  p  is  again  equivalent  to  finding  their 
MLEs  for  the  sample  of  u^^\  ...,  u^"^\  pretending  they  are  independent.  For  fixed 
p,  maximizing  Q^  with  respect  to  a  is  possible  in  closed  form.  For  notational  con- 
venience, denote  the  parts  depending  on  p  and  the  generated  sample  u^^\  . . . ,  u^"^^ 
by 


fit  >  * 


at 

m  .  ^ 


and 


^         771 

h{p,u)    =    -E(«m-P''«S'0"F^ 

fit     .V  / 


with  derivatives  with  respect  to  p  (indicated  by  a  prime)  given  by 

a't{p,u)    =    -2V"'6t(p,w) 


1 
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and 


m 

m  -*— ' 


The  maximum  likelihood  estimator  of  a  at  iteration  k  of  the  MCEM  algorithm  has 
form 

For  the  special  case  of  independent  random  effects  (p  =  0),  this  simplifies  to  the 
estimator  (^  Y17=i  w'^^'w^^))       presented  at  the  end  of  Section  2.3.1.  (The  equal 
correlation  structure  cannot  be  presented  as  a  special  case  of  the  autocorrelation 
structure.)  No  closed  form  solutions  exist  for  p^*^^  Let 

ct[p)  - 


1  -  p2d, 

and 

etip)  =  £  Hp)]' 
be  terms  depending  on  p  but  not  on  u,  with  derivatives  given  by 


4ip)  =  ^^ctip)  +  2p''-'[ct{p)] 


2 


and 


<^p^=j^'^(pytip)+j^i^t{p)]\ 

respectively.  Then,  the  first  and  second  partial  derivative  of  Q^  with  respect  to  p 
can  be  written  as 

T-l  ,     T-l 


d 

t=l  "      t=l 

o2  T-l 


^<3^    =    I]p'^'ct(p)  +  — E[Q(p)6t(p,w)-et(p)a<(p,w)] 


T-l 


+^  51  [4(p)ftt(P,  w)  +  ct{p)b[{p,  u)  -  e;(p)a,(p,  u)  -  et(p)a;(p,  m)] 
t=i 
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A  Newton-Raphson  algorithm  with  Hessian  formed  of  partial  and  mixed  derivatives 
of  Qli  with  respect  to  a  and  p  can  be  employed  to  find  the  maximum  likelihood 
estimators  at  iteration  k.  However,  since  the  range  of  p  is  restricted,  it  might  be 
advantageous  to  use  the  interval-halving  method  on  ^Q^  |^^-(it)  described  in  the 
previous  section. 

For  the  special  case  of  equidistant  time  points  {xj,  the  distances  dt  are  equal 
for  alH  =  1, . . . ,  r.  Without  loss  of  generality,  we  assume  dt  =  1  for  all  t.  Then  the 
random  effects  follow  the  simple  random  walk  Ut+i  =  pUt  -\-  e^t  =  1, . . . , T,  where 
we  assume  that  ui  ~  N{{),a'^)  and  e*  iid.  A^  (0,<t2(1  -  p'^)).  Certain  simplifications 
occur.  Let 


m    T-l 


,.ur 


1    ^     ,  >2  -,      m    1-1  m    T-l  ^      m    T-l 

J=i  j=i  t=i  j=i  t=i  ^"'  j^i  t=i 

denote  constants  depending  on  the  generated  samples  only,  but  not  on  any  parame- 
ters. Then  the  maximum  likelihood  estimator  of  a  at  iteration  k  is 

1/2 


a 


(k) 


1 
T 


1 


Q  +  ^{b~2pc  +  p^d) 


and,  upon  plugging  it  in  into  the  score  equation  §-fil^  |^^^(fc)  for  p,  we  obtain  as 
the  new  score  equation 


r-1 


{d-a) 


+  P^ 


2-T 


+  P 


-jT-a-^b-d 


+  c  =  0. 


(3.7) 


a  polynomial  of  order  three.  A  result  by  Witt  (1987)  mentioned  in  McKeown  and 
Johnson  (1996)  shows  that  (3.7)  has  three  real  solutions,  only  one  of  which  lies 
in  the  interval  (-1,1).  This  must  be  the  maximum  likelihood  estimator  p^*^)  at 
iteration  k  for  the  case  of  equidistant  time  points.  Exact  solutions  to  this  third 
degree  polynomial  are  for  instance  given  in  Abramowitz  and  Stegun  (1964)  and  we 
only  need  to  iterate  between  the  two  explicit  solutions  till  convergence. 
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In  most  of  our  applications  the  number  of  distinct  observation  times  T 
is  rather  large,  and  generating  independent  T-dimensional  vectors  u  from  the 
posterior  h{u  \  y;  ^(*^"^))  as  required  to  approximate  the  E-step  is  difficult,  even 
with  the  nice  (prior)  autoregressive  relationship  among  the  components  of  u.  The 
next  section  discusses  this  issue. 

There  are  many  other  correlation  structures  which  are  not  discussed  here.  For 
instance,  the  first  order  autoregressive  random  process  can  be  extended  to  a  p-order 
random  process  and  the  formulas  provided  here  and  in  the  next  section  can  be 
modified  accordingly. 

3.4     Sampling  from  the  Posterior  Distribution  Via  Gibbs  Sampling 

In  Section  2.3.2  we  gave  a  general  description  of  how  to  obtain  a  random  sam- 
ple m(i),  . . .,«("»)  from  h(u  \  y).  (As  in  Section  2.3.2,  we  suppress  the  dependency 
on  the  parameter  estimates  from  the  previous  iteration).  For  high  dimensional 
random  effects  distributions  g{u),  generating  independent  draws  from  h{u  \  y) 
can  get  very  time  consuming,  if  not  impossible.  The  Gibbs  sampler  introduced  in 
Section  2.3.2  offers  an  alternative  because  it  involves  sampling  from  lower  dimen- 
sional (often  univariate)  conditional  distributions  of  h(u  \  y),  which  is  considerably 
faster.  However,  it  results  in  dependent  samples  from  the  posterior  random  eff'ects 
distribution.  The  distributional  structure  of  equally  correlated  random  effects 
or  autoregressive  random  effects  is  very  amenable  to  Gibbs  sampling  because  of 
the  simplifications  that  occur  in  the  full  univariate  conditionals.  Remember  that 
the  two-stage  hierarchy  and  the  conditional  independence  assumption  in  GLMMs 
implies  that 

T 

Hu  I  y)  a  f{y  I  u)g{u)  =  J]^  /(y,  |  ut)  g{u), 

(=1 
the  product  of  the  product  of  conditional  densities  of  observations  sharing  a 

common  random  effect  and  the  random  effects  density.  In  the  following,  let 

u=  {ui,...,ut).  We  discuss  the  case  of  autoregressive  random  effects  first. 
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3.4.1      A  Gibbs  Sampler  for  Autoregressive  Random  Effects 

From  representation  (3.6)  of  the  random  effects  distribution,  we  see  that  the 
full  univariate  conditional  distribution  of  Ut  given  the  other  T  -1  components  of  u 
only  depends  on  its  neighbors  Ut^i  and  u^+i,  i.e., 

g(ut  I  Ml, ... ,  ut-i,  ut+i,  ...,ut)<x  g{ut  \  Ut-i)g{ut+i  \  Ut),  t  =  2,...,T-l. 

At  the  beginning  {t  =  1)  and  the  end  {t  =  T)  of  the  process,  the  conditional 
distribution  of  ui  and  Ut  only  depend  on  the  successor  U2  and  predecessor  ut^i, 
respectively.  Furthermore,  random  effect  Ut  only  applies  to  observations  y^  = 
(Vn,--,  Vtnt)  at  a  common  time  point  that  share  that  random  effect,  but  not  to 
other  observations.  Hence,  the  full  univariate  conditionals  of  the  posterior  random 
effects  distribution  can  be  expressed  as 

hi{ui  I  U2,  Vi)    oc    f{y^  I  ui)  gi{ui  \  Ua) 
ht{ut\ut-uut+i,yt)    oc    f{yt\ut)gt{ut\ut-i,Ut^i),t  =  2,...,T-l 
hriuT  I  ut-i,  y-r)    oc    f{yj,  \  ut)  griur  \  mt-i), 

where,  using  standard  multivariate  normal  theory  results, 

V  1  —  p2(d(_i+d()  ' 


1  _  p2(d(_i+dt)  /'    ^-2, ...,T      1 

For  equally  spaced  data  {dt  =  1  for  all  t)  these  distributions  reduce  to  the  ones 
derived  in  Chan  and  Ledolter  (1995). 

Direct  sampling  from  the  full  univariate  conditionals  ht  is  not  possible. 
However,  it  is  straightforward  to  implement  an  accept-reject  algorithm.  In  fact. 


73 


the  accept-reject  algorithm  as  outhned  in  Section  2.3.2  applies  directly  with  target 
density  ht  and  candidate  density  gt,  since  ht  has  the  form  of  an  exponential  family 
density  multiplied  by  a  normal  density  In  Section  2.3.2,  we  discussed  the  accept- 
reject  algorithm  for  generating  an  entire  vector  u  from  the  posterior  random  effects 
distribution  h{u  \  y)  with  candidate  density  g{u)  and  mentioned  that  acceptance 
probabilities  are  virtually  zero  for  large  dimensional  w's.  With  the  Gibbs  sampler, 
we  have  reduced  the  problem  to  univariate  sampling  of  the  i-th  component  Ut  from 
the  univariate  target  density  ht  with  univariate  candidate  density  gt.  By  selecting 
^t  =  iLiVt),  where  L(j/()  is  the  saturated  likelihood  for  observations  at  time  point 
t,  we  ensure  that  the  target  density  ht  <  Mtgt- 

Given  w^^-^)  =  (u^'^\  ...,  u^'^^  j  from  the  previous  iteration,  the  Gibbs 
sampler  with  accept-reject  sampling  from  the  full  univariate  conditionals  consists  of 

1.  generate  first   component  u^^  ^  hi{ui  \  u^^~^\y^)  by 

(a)  generation  step: 

generate  Ui   from  candidate  density  gi{ui  \u2~^^); 
generate  U  ~  Unif  orm[0, 1] ; 

(b)  acceptance  step: 

set  ui  =ui   if  U<f{y^\ui)/L{y^);   return  to  (a)  otherwise; 

2.  for  t  =  2,...,T-l: 

generate  component  «?^  ~ /itrf'  |  «!-i,^m^',  J/t)  by 

(a)  generation  step: 

generate  Ut   from  candidate  density  gt{ut  \  u'f\,u[^_^j^^); 
generate  U  '^  Unif  orm[0, 1] ; 

(b)  acceptance  step: 

set  Ut     =ut   if  U  <f{y^\ut)/L{y^);   return  to  (a)  otherwise; 

3.  generate  last  component  u^rp  ~  hriur  \  u^t-vVt)   by 

(a)  generation  step: 
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generate  ut  from  ceindidate  density  Qriur  \  Uj,_i); 
generate  U  ~  Unif  orin[0, 1] ; 
(b)   acceptcince  step:  *■ 

set  Uj.'  =  Ut  if  U  <  f{yrp  \  UT)/L{yj.);   return  to    (a)   otherwise; 
4.   set  m(^)  =  (uf\...,v!'Py, 

The  so-obtained  sample  u^^\  . . . ,  w^"^)  (after  allowing  for  burn-in)  forms  a 
dependent  sample  which  we  use  to  approximate  the  E-step  in  the  A;th-  iteration  of 
the  MCEM  algorithm.  Note  that  all  densities  are  evaluated  at  current  parameter 
estimates,  i.e., /9         for /(y^  |  Ut)  and  •0  =   (a^*^"^\p(^"^))  for  yt(iit  | 

Mt-i,«t+i)- 
3.4.2     A  Gibbs  Sampler  for  Equally  Correlated  Random  Effects 

Similar  results  as  for  the  autoregressive  correlation  structure  can  be  derived 
for  the  case  of  equally  correlated  random  effects.  In  this  case,  the  full  univariate 
conditional  of  Ut  depends  on  all  other  t  -  \  components  of  w,  as  can  be  seen  from  , 
(3.3).  Let  Ut-  denote  the  vector  u  with  the  t-i\\  component  deleted.  Using  similar 
notation  as  in  the  previous  section,  the  full  univariate  conditionals  oih{u  \  y)  are 
given  by 

ht{ut\ut-,yt)oif{yt\ut)gt{ut\ut-,yt),t  =  \,...,T, 

where  with  standard  results  from  multivariate  normal  theory  gt{ut  \  Wj-)  is  a 
A^(//(,rj^)  density  with  ./ ;;*'     •         '  v 

p       (r-i)p(i-p)- 


^J't  = 


i-p      i  +  (r-2)p  J 


^._^.(i    (y-i)p^  I  (r-i)V(i-p) 


and 

l-p  i  +  (T-2)p 

Given  the  vector  w'^"^)  from  the  previous  iteration,  the  Gibbs  sampler  with 
accept-reject  sampling  from  the  full  univariate  conditionals  has  form 


-  *-• 
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1.  for  t  =  1,...,T, 

generate  components  u^    ^  h{ut  \  u^   , . . . ,  ui_i,  u^.^^    ,Ut~    )  by 

(a)  generation  step: 

generate  Ut   from  ceaididate  density 

generate  U  ~  Unif  orm[0, 1] ; 

(b)  acceptauice  step: 

set  uf   =  Ut   if  U  <  f{yf  \  Ut)/L{yf);   return  to  (a)  otherwise; 

2.  set  u(^)  =  («?■),...,«?)); 

This  leads  to  a  sample  u'^^ , . . . ,  u^"*)  from  the  posterior  distribution  used 

in  the  E-  and  M-step  of  the  MCEM  algorithm  at  iteration  A;.  Note  again  that 

-  (fc— 1) 
all  distributions  are  evaluated  at  their  current  parameter  estimates  P         and 

3.5     A  Simulation  Study 

We  conducted  a  simulation  study  to  evaluate  the  performance  of  the  maximum 
likelihood  estimation  algorithm,  to  evaluate  the  bias  in  the  estimation  of  covariate 
effects  and  variance  components  and  to  compare  predicted  random  effects  to  the 
ones  used  in  the  simulation  of  the  data.  To  this  end,  we  generated  a  time  series 
Vi,  ■  ■  ■  iUt  oiT  =  400  binary  observations  according  to  the  model 

logit(7rt(Mt))  =  a  +  pxt  +  ut  (3.8) 

for  the  conditional  log  odds  of  success  at  time  f,  f  =  1, . . . ,  400.  For  the  simulation, 
we  choose  a  =  1  and  /3  =  1,  where  /3  is  the  regression  coefficient  for  independent 
standard  normal  distributed  covariates  Xt  i.  i.  d.  A^(0, 1).  The  random  effects 
ui, . . . ,  «T  are  thought  to  arise  from  an  unobserved  latent  random  autoregressive 


process  Ut+i  —  put  +  e^,  where  e^  are  i.  i.  d.  N{0,ay/l  —  fP),  i.e.,  the  tit's  have  stan- 
dard deviation  a  and  lag  t  correlation  p*.  For  the  simulation  of  these  autoregressive 


i 
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random  effects,  we  used  cr  =  2  and  p  =  0.8.  The  resulting  sample  autocorrelation 
function  of  the  realized  random  effects  is  pictured  in  Figure  3-4.  Their  standard 
deviation  and  lag  1  correlation  of  the  400  realized  values  of  «i , . . . ,  mt  is  equal  to 
1.95  and  0.77.  Note  that  conditional  on  the  realized  values  of  u^'s,  the  y^'s  are 
generated  independently  with  log  odds  given  by  (3.8).  The  MCEM  algorithm  as 
described  in  Sections  2.3  and  3.3  for  a  logistic  GLMM  with  autocorrelated  random 
effects  yielded  the  following  maximum  likelihood  estimates  for  the  fixed  effects  and 
variance  components:  a  =  0.94  (0.39),^  =  1.03  (0.22),  as  compared  to  the  true 
values  1  and  1,  and  a  —  2.25  (0.44),  and  p  —  0.74  (0.06),  as  compared  to  the  true 
values  1.95  and  0.77. 

The  algorithm  converged  after  71  iterations  with  a  starting  Monte  Carlo 
sample  size  of  50  and  a  final  Monte  Carlo  sample  size  of  only  880,  although 
estimated  standard  errors  are  based  on  a  Monte  Carlo  sample  size  of  20, 000. 
Convergence  parameters  were  set  to  ei  =  0.003,  c  =  3,  £2  =  0.005,  £3  =  —0.001, 
a  =  1.03  and  q  —  1.05  (see  Section  2.3.3).  Regular  GLM  estimates  were  used  as 
starting  values  for  a  and  /3  and  starting  values  for  o  and  p  were  set  to  1.5  and  0, 
respectively. 

As  will  be  described  in  Section  5.4.2,  we  estimated  random  effects  through  a 
Monte  Carlo  approximation  of  their  posterior  mean:  u  =  E[u  \  y].  The  scatter 
plot  in  Figure  3-3  shows  good  agreement  in  a  comparison  of  the  realized  random 
effects  Ui, . . .  ,ut  from  the  simulation  and  the  estimated  random  effects  ui,. . .  ,ut 
from  the  model.  Note  though  that  the  standard  deviation  of  the  estimated  random 
effects  is  equal  to  1.60  (as  compared  to  the  true  standard  deviation  of  1.95), 
showing  that  estimated  random  effects  are  less  variable  and  the  general  shrinkage 
effect  (compare  the  scales  on  the  x  and  y  axis  of  Figure  3-3)  brought  along  by 
using  posterior  mean  estimates.  Also,  a  comparison  of  the  autocorrelation  and 
partial  autocorrelation  functions  of  the  realized  and  estimated  random  effects  in 
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Figure  3  3:  Realized  (simulated)  random  effects  ui, . . .  ,«t  versus  estimated  ran- 
dom effects  ui,. . . ,  iirp. 

Figure  3-4  reveals  some  differences  due  to  the  fact  that  estimated  random  effects 
are  based  on  the  posterior  distribution  of  u  |  y.  Therefore,  estimated  random 
effects  are  only  of  limited  use  in  checking  assumptions  on  the  true  random  effects. 
Only  when  their  behavior  is  grossly  unexpected  compared  to  the  assumed  structure 
of  the  underlying  latent  random  process  may  they  serve  as  an  indication  of  model 
inappropriateness.  Related  remarks  are  given  by  Verbeke  and  Molenberghs  (2000) 
who  generate  data  in  a  linear  mixed  models  assuming  a  mixture  of  two  normal 
distributed  random  effects  resulting  in  a  bimodal  distribution.  There  also  the  plot 
of  posterior  mean  estimates  of  the  random  effects  from  a  model  that  misspecified 
the  random  effects  distribution  does  not  reveal  that  anything  went  wrong. 

We  repeated  above  simulation  100  times,  using  the  same  specifications, 
starting  values  and  convergence  criteria  as  mentioned  above.  Each  of  the  100 
generated  binary  time  series  of  length  400  was  fit  using  the  MCEM  algorithm. 
Table  3-1  shows  the  average  (over  the  100  generated  time  series)  of  the  fixed 
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Figure  3-4:  Comparing  simulated  and  estimated  random  effects. 
Sample  autocorrelation  (first  row)  and  partial  autocorrelation  (second  row)  func- 
tions for  realized  (simulated)  random  effects  Ui,...,Ut  (first  column)  and  estimated 
random  effects  Ui,. .  .,ut  (second  column). 


79 

parameter  and  variance  component  estimates  and  their  average  estimated  standard 
errors.  On  average,  the  GLMM  estimates  of  the  fixed  effects  a  and  /3  and  the 
variance  components  are  very  close  to  the  true  parameters,  although  the  true  lag 
1  correlation  of  the  random  effects  is  underestimated  by  6.3%.  Table  3-1  also 
displays,  in  parentheses,  the  standard  deviations  of  all  estimated  parameters  in  the 
100  replications.  Comparing  these  to  the  theoretical  estimates  of  the  asymptotic 
standard  errors,  we  see  good  agreement.  This  suggests  that  the  procedure  for 
finding  standard  errors  we  described  and  implemented  (via  Louis's  (1982)  formula) 
in  our  MCEM  algorithm  works  fine.  In  5  (5%)  out  of  the  100  simulations,  the 
approximation  of  the  asymptotic  covariance  matrix  by  Monte  Carlo  methods 
resulted  in  a  negative  definite  matrix.  For  these  simulations,  a  larger  Monte  Carlo 
sample  after  convergence  of  the  MCEM  algorithm  (the  default  was  20,000)  might 
be  necessary. 

It  is  also  interesting  to  note  that  of  the  95  simulations  with  positive  definite 
covariance  matrix,  6  (6.3%)  resulted  in  a  non-significant  (based  on  a  5%  level 
Wald  test)  estimate  of  the  regression  coefficient  /3  under  the  GLMM  model  with 
autoregressive  random  effects,  while  none  was  declared  non-significant  with  the 
GLM  approach.  Estimates  and  standard  errors  for  a  corresponding  GLM  model 
fit  are  also  provided  in  Table  3-1.  The  average  Monte  Carlo  sample  at  the  final 
iteration  of  the  MCEM  algorithm  was  1200,  although  highly  disperse,  ranging  from 
210  to  21,000.  The  average  computation  time  (on  a  mobile  Pentium  III,  600  MHz 
processor  with  256MB  RAM)  to  convergence,  including  estimating  the  covariance 
matrix,  was  73  minutes.  '  * 

We  ran  two  other  simulation  studies,  now  with  a  shorter  length  of  only 
T  =  100  observations,  and  a  true  lag  1  correlation  of  0.6  and  -0.8,  respectively. 
All  other  parameters  remained  unchanged.  These  results  are  also  summarized  in 
Table  3-1.  Again,  we  observe  that  the  estimated  parameters  are  very  close  to  the 
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Table  3-1:  A  simulation  study  for  a  logistic  GLMM  with  autoregressive  random 
effects. 


a 

P 

a 

P 

s.e.(Q!) 

s.e.(/3) 

s.e.{a) 

s.e.(/9) 

True: 

1 

1 

2 

0.8 

T  = 

400 

GLM: 

0.64 
(0.20) 

0.63 

(0.14) 

0.11 
(0.01) 

0.12 
(0.01) 

GLMM: 

1.07 

1.02 

2.08 

0.75 

0.37 

0.26 

0.61 

0.10 

(0.32) 

(0.20) 

(0.25) 

(0.06) 

(0.12) 

(0.23) 

(0.40) 

(0.05) 

True:         1             1  2  0.6             T  =  100 

GLM:      0.69  0.70  0.23        0.25 

(0.38)  (0.27)  (0.02)      (0.03) 

GLMM:      1.09  1.07  1.99       0.51        0.58        0.47        1.35        0.26 

(0.58)  (0.37)  (0.39)     (0.20)     (0.33)      (0.27)      (1.15)      (0.18) 


True: 

1 

1 

2 

-0.8 

T  = 

100 

GLM: 

0.65 

0.61 

0.22 

0.24 

(0.21) 

(0.26) 

(0.01) 

(0.03) 

GLMM: 

1.04 

0.96 

2.00 

-0.75 

0.42 

0.51 

1.04 

0.16 

(0.29) 

(0.34) 

(0.53) 

(0.13) 

(0.32) 

(0.99) 

(1.04) 

(0.13) 

Average  and  standard  deviation  (in  parentheses)  of  fixed  effects,  variance  compo- 
nents and  their  standard  error  estimates  from  a  GLM  and  a  GLMM  with  latent 
AR(1)  process.  The  two  models  were  fitted  to  each  of  100  generated  binary  time 
series  of  length  T  =  400  and  T  =  100. 

true  ones,  but  on  average  the  correlation  was  underestimated  by  15%  and  6.3%, 
respectively.  However,  the  sampling  errors  of  the  correlation  parameters  (shown  in 
parentheses  in  Table  3-1)  were  large  enough  to  include  the  true  values. 

Since  our  methods  are  general  enough  to  handle  unequally  spaced  data,  we 
repeated  the  first  simulation  with  a  time  series  of  T  =  400  binary  observations,  but 
now  randomly  deleted  10%  of  the  observations  to  create  random  gaps  in  the  series. 
We  left  all  parameters  and  the  model  for  the  conditional  odds  unchanged,  except 
that  we  now  assume  that  random  effects  follow  the  latent  random  autoregressive 
process  Ut+i  =  p^^Ut  +  Cj,  where  e^  are  i.  i.  d.  N{<d,a\Jl  —  p^'^*)  and  dt  is  the 
difference  (in  the  units  of  measurement)  between  the  time  points  associated  with 
the  observations  at  times  t  and  t-\-l.  For  example,  the  first  series  we  generated  had 
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Table  3-2:  Simulation  study  for  modeling  unequally  space  binary  time  series. 


a ^ CT p  s.e.(Q;)     s.e.(/3)     s.e.(cr)     s.e.(p) 

True:         1             1             2           08  T  =  360,  unequally  spaced 
GLM:      0.61        0.62  0.12         0.12 
(0.19)     (0.11)  (0.00)      (0.01) 
GLMM:      1.03        1.00       2.07       0.75  0.38        0.28        0.71        0.11 
(0.29)     (0.16)     (0.25)     (0.06)  (0.19)      (0.28)      (0.80)      (0.18) 


Average  and  standard  deviation  (in  parentheses)  of  fixed  effects,  variance  compo- 
nents and  their  standard  error  estimates  from  a  GLM  and  a  GLMM  with  latent  au- 
toregressive  random  eflTects  accounting  for  unequally  spaced  observations.  The  two 
models  were  fitted  to  each  of  100  generated  binary  time  series  of  length  T    =    360, 
with  random  gaps  of  random  length  between  observations. 

1  gap  of  length  three  (i.e.,  dt  =  i  for  one  t),  4  gaps  of  length  two  (i.e.,  dt  =  3  for  4 
fs)  and  29  gaps  of  length  one  (i.e.,  dt  =  2  for  29  f  s).  For  all  other  Vs,  dt  =  1,  i.e., 
they  are  successive  observations  and  the  difference  between  two  of  them  is  one  unit 
of  measurement. 

Simulation  results  are  shown  in  Table  3-2  and  reveal  that  our  proposed 
methods  and  algorithm  also  work  fine  for  an  unequally  spaced  binary  time  series. 
All  true  parameters  are  included  in  confidence  intervals  based  on  the  average  of  the 
estimated  parameters  from  100  replicated  series  and  its  standard  deviation  (shown 
in  parentheses  in  Table  3-2). 


CHAPTER  4 
MODEL  PROPERTIES  FOR  NORMAL,  POISSON  AND  BINOMIAL 

OBSERVATIONS 

So  far  we  have  discussed  models  for  discrete  valued  time  series  data  in  a  very 
broad  manner.  In  Section  2,  we  developed  the  Hkelihood  for  our  models  based  on 
generic  distributions  f{y\u)  for  observations  y  and  g{u)  for  random  effects  u  and 
presented  an  algorithm  for  finding  maximum  likelihood  estimates.  Section  3  looked 
at  two  special  cases  of  random  effects  distributions  useful  for  describing  temporal  or 
spatial  dependencies.  In  this  chapter  we  make  specific  distributional  assumptions 
about  the  observations  and  develop  some  theory  underlying  the  models  we  propose. 
We  will  pay  special  attention  to  data  in  the  form  of  a  single  (sometimes  considered 
generic)  time  series  Y  =  {Yi,...,Yt)  and  derive  marginal  properties  implied  by 
the  conditional  model  formulation.  Multiple,  independent  time  series  Yi,...,Yn 
can  result  from  replication  of  the  original  time  series  or  from  stratification  of  the 
sampled  population  such  as  in  the  example  about  homosexual  relationships.  All 
derivations  given  below  for  a  generic  time  series  Y  still  hold  for  the  i-th  series 
Yi  =  {yn,  Yi2, . . . ,  Yt),  provided  the  same  latent  process  {ut}  is  assumed  to 
underly  each  one  of  them. 

An  important  characteristic  of  any  time  series  model  is  its  implied  serial 
dependency  structure.  In  the  case  of  normal  theory  time  series  models,  this  is 
specified  by  the  autocorrelation  function.  In  Section  4.1  we  derive  the  implied 
marginal  autocorrelation  function  for  GLMMs  with  normal  random  components 
and  either  an  equal  correlation  or  autoregressive  assumption  for  the  random  effects. 
With  these  assumptions,  our  models  are  special  cases  of  linear  mixed  models 
discussed  for  instance  in  Diggle  et  al.  (2002).  In  Sections  4.2  and  4.3  we  explore 
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marginal  properties  of  GLMMs  with  Poisson  and  binomial  random  components 
that  are  induced  by  assuming  equally  correlated  or  autoregressive  random  effects. 
In  Chapter  5,  these  model  properties  such  as  the  implied  autocorrelation  function 
are  then  compared  to  empirical  counterparts  based  on  the  observed  data  to 
evaluate  the  proposed  model. 

Section  2.1  mentioned  that  parameters  in  GLMMs  have  a  conditional  inter- 
pretation, controlling  for  the  random  effects.  Correlated  random  effects  vary  over 
time  and  parameter  interpretation  is  different  from  having  just  one  common  level  of 
a  random  effect,  as  in  many  standard  random  intercepts  GLMMs.  For  each  of  the 
models  presented  here,  we  discuss  parameter  interpretation  in  a  separate  section. 
4,1      Analysis  for  a  Time  Series  of  Normal  Observations 

Suppose  that  conditional  on  time  specific  normal  random  effects  {ut},  observa- 
tions {Yt}  are  independent  N{nt  + Ut^r"^).  The  marginal  likelihood  for  this  model  is 
tractable,  because  marginally  the  joint  distribution  of  {Yt\  is  multivariate  normal 
with  mean  ii^  =  (/xi, . . . ,  fxr)'  and  covariance  matrix  E^  +  r^/,  where  Su  is  the 
covariance  matrix  of  the  joint  distribution  of  {ut\.  With  the  usual  assumption  that 
var('Ut)  =  (7^,  the  marginal  variance  of  Yt  is  given  by 

var(yi)  =  r2  +  (72 

and  the  marginal  correlation  function  p{t,t*)  for  the  case  of  equally  correlated 
random  effects  (conf.  Section  3.2)  has  form 

2 

p{t,e)  =  corr(F,,F,.)  =  ^^p,  (4.1) 

while  for  the  case  of  autocorrelated  random  effects  (conf.  Section  3.3),  it  has  form 

p{t,r)  =  corr(y„r,o  =  -^pn'=V<i>^,  (4.2) 

r^  +  (7"' 
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If  the  distances  between  time  points  are  equal,  then  (4.2)  is  more  conveniently 
written  in  terms  of  the  lag  h  between  observations  as 

2 
p{h)  =  COTv{Yt,Yt+,)  =  ^^/9^ 

For  both  cases,  note  that  the  autocorrelations  (4.1)  and  (4.2)  are  smaller  than 
the  corresponding  ones  assumed  for  the  underlying  latent  process  {ut}  by  a  factor 

2 

of  r2\^2  ■  For  equally  correlated  random  effects,  the  marginal  covariance  matrix 
has  form  r^/  +  a^  [(1  -  p)I  +  pj],  implying  equal  marginal  correlations  between 
any  two  members  Yt  and  Yt*  oi  {Yt}.  (This  can  also  be  seen  from  (4.1),  where  the 
autocorrelations  do  not  depend  on  t  or  t*.)  Diggle  et  al.  (2002,  Sec.  5.2.2)  call  this 
a  model  with  serial  correlation  plus  measurement  error. 

Similar  properties  can  be  observed  in  the  case  of  autocorrelated  random  ef- 
fects: The  basic  structure  of  correlations  decaying  in  absolute  value  with  increasing 
distances  between  observation  times  (as  measured  by  Y^  dk  or  h)  is  preserved 
marginally.  However,  the  first-order  Markov  property  of  the  underlying  autore- 
gressive  process  is  not  preserved  in  the  marginal  distribution  of  {FJ,  which  can 
be  proved  by  calculating  conditional  distributions.  For  instance,  for  three  (T  =  3) 
equidistant  time  points,  the  conditional  mean  of  F3  given  Fi  =  yi  and  I2  =  ?/2  is 
equal  to  '   , 

2 

and  depends  on  yi. 

It  should  be  noted  that  in  the  case  of  independent  random  effects  with 
T>u  -  0-2/,  marginally  the  F^'s  are  also  independent,  but  with  overdispersed 
variances  r^  +  a^  relative  to  their  conditional  distribution.  This  case  can  be  seen 
as  a  special  case  of  the  equally  correlated  model  and  the  autoregressive  model  when 
p  =  0. 
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The  traditional  assumption  in  random  intercepts  models  is  to  assume  a 
common  random  effect  u  —  Ut  for  all  time  points  t.  I.e.,  conditional  on  a  N{0,a^) 
random  effect  u,  Yt  is  N{fit  +  w,  r^)  for  i  =  1, . . . ,  T.  For  this  case,  the  marginal 
covariance  matrix  has  form  t"^!  +  a^J.  This  can  be  derived  directly  or  inferred 
from  the  marginal  correlation  expressions  (4.1)  and  (4.2)  by  setting  p  =  1,  implying 
perfect  correlation  among  the  {ut}.  Hence,  the  random  intercepts  model  is  a 
special  case  of  the  equal  correlated  or  autoregressive  model  when  /o  =  1.  It  implies 
a  constant  (exchangeable)  marginal  correlation  of  (T^/(r^  +  cr^)  between  any  two 
observations  Yt  and  Yf. 
4.1.1      Analysis  via  Linear  Mixed  Models 

In  a  GLMM,  we  try  to  provide  some  structure  for  the  unknown  mean  com- 
ponent /it  by  using  covariates  Xt-  Let  a; J/3  be  a  linear  predictor  for  //<,  with  P 
denoting  a  fixed  effects  parameter  vector  for  the  covariates  Xt-  Using  an  iden- 
tity link,  the  series  {Yt}  then  follows  a  GLMM  with  conditional  mean  function 
E\Yt  I  Ut]  —  x'tl3  +  Ut.  The  model  can  be  written  as  Yt  =  x't0  +  Ut  +  Ct,  where 
et  ~  N{0,T^)  and  independent  of  Wf  Then,  the  models  discussed  here  are  special 
cases  of  mixed  effects  models  (Verbeke  and  Molenberghs,  2000)  with  general  matrix 
form 

Y  =  X^  +  Zu  +  e. 

In  our  case,  Y  =  (Fi, . . . ,  Yt)'  is  the  time  series  vector  and  X  =  {x[, . . . ,  x'^)' 
is  the  overall  design  matrix  with  associated  parameter  /3.  The  design  matrix  Z 
for  the  random  effects  u'  =  (ui, . . . ,  ut)  simplifies  to  the  identity  matrix  It-  The 
distributional  assumption  on  the  random  effects  is  u  ~  N{0,'Eu)  and  they  are 
independent  from  the  N{0,  r^/)-distributed  errors  e.  Exploiting  this  relationship, 
software  for  fitting  models  of  this  kind  (i.e.,  correlated  normal  data  with  structured 
covariance  matrix  of  form  var(y)  =  ZS^Z'  +  t^I)  is  readily  available,  for  instance 
in  the  form  of  the  SAS  procedure  proc  mixed,  where  the  equal  correlation  structure 
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and  the  autoregressive  structure  are  only  two  out  of  many  possible  choices  for  the 
covariance  matrix  T,u  for  the  random  effects  distribution. 

Mixed  effects  models  are  very  popular  for  the  regression  analysis  of  shorter 
•      .       time  series,  like  growth  curve  models  or  data  from  longitudinal  studies.  In  Sec- 
tion 5.1,  we  illustrate  an  application  by  analyzing  the  motivating  example  of 
\  Section  3.1  about  attitudes  towards  homosexual  relationships,  based  on  a  normal 

approximation  to  the  log  odds. 
4.1.2     Parameter  Interpretation 

Parameters  in  normal  time  series  models  retain  their  interpretation  when 
averaging  over  the  random  effects  distribution.  The  interpretation  of  0  as  the 
change  in  the  mean  for  a  change  in  the  covariates  is  valid  conditional  on  random 
effects  and  also  marginally.  The  random  effects  parameters  only  contribute  to  the 
variance-covariance  structure  of  the  marginal  distribution,  inducing  overdispersion 
-      and  correlation  relative  to  the  conditional  assumptions. 

4.2     Analysis  for  a  Time  Series  of  Counts 
^ V  .  ,  Suppose  now  that  conditional  on  time  specific  normal  random  effects  {«<}, 

.   '  observations  {Yt}  are  independent  counts,  which  we  model  as  Poisson  random 

variables  with  mean  fit.  Using  a  log  link,  explanatory  variables  Xt  and  correlated 
random  effects  {wj,  we  specify  the  conditional  mean  structure  of  a  Poisson  GLMM 
as 
;'  ■ '  hgifit)  =x't0  +  ut,t  =  l,...,T.  (4.3) 

The  correlation  in  the  random  effects  allows  the  log-means  to  be  correlated  over 
time  or  space.  The  marginal  likelihood  corresponding  to  this  model  is  given  by 


my.  ' 


L{P,il}-y)    oc     /     TT^f  exp{-/it}£?(w;^)du 

"    y  ^  ^^P  ]  IZ  [2/«(^*^  +  «*)  -  exp{a;;;3  +  ut}]  \  g{u-  ip)  du, 
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:^.__  where  g{u;  tj))  is  one  of  the  random  effects  distributions  of  Chapter  3.  In  that  case, 

>;  the  integral  is  not  tractable  and  numerical  methods  such  as  the  MCEM  algorithm 

of  Section  2.3  must  be  used  to  find  maximum  likelihood  estimates  for  0  and  t/».  For 

this,  the  function  Q^  defined  in  (2.16)  has  form 

^       m       T 

j=i  t=\ 

•^^  where  Uj    is  the  t-ih  element  of  the  j-th  generated  sample  w^-''  from  the  posterior 

;•:   , .  distribution  h{u\y;  /3^*'~^\  t/>^''~^)).  Note  that  here  we  discuss  only  the  case  of  a 

"  generic  time  series  {FJ  with  no  replication,  hence  n  =  1  (i.e.,  index  i  is  redundant), 

',>  and  ni  =  T  in  the  general  form  presented  in  (2.16).  If  replications  are  available 

'■'-:  or  m  the  case  where  two  time  series  differ  in  the  fixed  effects  part  but  not  in  the 

random  effects  (e.g.,  have  the  same  underlying  latent  process),  then  one  simply 
needs  to  include  the  sum  over  the  replicates  as  indicated  in  (2.16).  Choosing  one 
of  the  correlated  random  effects  distributions  of  Chapter  3,  the  Gibbs  sampling 
algorithms  developed  in  Sections  3.4.1  or  3.4.2  can  be  used  to  generate  the  sample 
from  h{u\y),  with  f{yt)  having  the  form  of  a  Poisson  density  with  mean  m. 
4.2.1     Marginal  Model  Implied  by  the  Poisson  GLMM 
;  As  with  the  normal  GLMMs  before,  marginal  first  and  second  moments  can 

be  obtained  by  integrating  over  the  random  effects  distribution,  although  here 
,  the  complete  marginal  distribution  of  Vt  is  not  tractable  as  it  is  in  the  normal 

?'  case.  The  random  effects  appearing  in  model  (4.3)  imply  that  the  conditional 

log-means  {log(/[i()}  are  random  quantities.  Assuming  that  random  effects  {mj}  are 
.^  \'  ^  normal  with  zero  mean  and  variance  var(uj)  =  a^,  they  have  expectations  {x[fi} 

and  variance  a^.  For  two  distinct  time  points  t  and  t*,  their  correlation  under  an 
^.       '  independence,  equal  correlation  or  autocorrelation  assumptions  on  the  random 

eff"ects  is  given  by  0,  p  or  p^k=t  '^fc ,  respectively.  (Remember  that  4  denoted  the 
time  difference  between  two  successive  observations  yk  and  yk+i-)  On  the  original 
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scale,  the  means  have  expectation,  variance  and  correlation  given  by 


var(/zt)    =    exp{2(a;;/3  +  aV2)}(^e'^'-l) 


gCOv(U(,U(»)    _    ^ 

corr(/xt,/it.)    = 


e"^'  -1 


Plugging  in  cov{ut,Ut*)  =  0,   a^p  or  a'^p^k=t  ^k  yields  the  marginal  correlations 
among  means  when  assuming  independent,  equally  correlated  or  autoregressive 
random  effects,  respectively.  -  ^'% 

4.2.1.1     Marginal  distribution  of  y^  '  "' 

Now  let's  turn  to  the  marginal  distribution  of  Yt  itself,  for  which  we  can  only 
derive  moments.  The  marginal  mean  and  variance  of  Yt  are  given  by: 

E[Yt]  =  E[^l^]  =  exp{a;;/3  +  a""  12]  (4.4) 

var(r,)  =  E[ixt]  +  var(^,)  =  E[Yt]  [l  +  E[Yt\  (e""  -  l)'  . 

Hence,  the  log  of  the  marginal  mean  still  follows  a  linear  model  with  fixed  effects 
parameters  /3,  but  with  an  additional  offset  a'^/2  to  the  intercept  term.  (This  is  not 
particular  to  the  Poisson  assumption,  but  is  true  for  any  loglinear  random  effects 
model  of  form  (4.3)  with  more  general  random  effects  structure  z[ut,  see  Problem 
13.42  in  Agresti,  2002.)  The  marginal  distribution  of  Yt  is  not  Poisson,  since  the 
variance  exceeds  the  mean  by  a  factor  of  [1  +  E\Yt]{e''^  -  1)].  The  marginal  variance 
is  a  quadratic  function  of  the  marginal  mean. 

For  two  distinct  time  points  t  and  t* ,  the  marginal  covariance  between 
observations  Yt  and  Yf  is  given  by 

cov  {Yt,Yt')    =    cov(/it,^t.) 

=  £;[yi]E[y,.]  (e~^("""'-)  - 1)  (4.5) 

=    expKx;  +  x[,)P  +  a""}  (e'=°^("""'-)  -  i) . 
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In  the  case  where  random  effects  {ut}  are  assumed  independent,  the  marginal 
covariance  is  zero.  In  longitudinal  studies,  usually  each  replicated  time  series  has 
its  own  univariate  random  effect  attached  to  it.  For  such  a  time  series  {Yt},  assume 
a  single  common  random  effect  u  ~  Ar(0,  a^)  shared  by  all  observation  in  the 
series.  I.e.,  in  the  notation  used  above,  «  =  «(  for  all  t  and  model  (4.3)  has  form 
logifit)  =  x[l3  +  u.  Then,  cov{ut,Ut')  =  var(M)  =  a^  and  the  marginal  correlation 
between  any  two  members  of  the  time  series  is  given  by 

This  is  the  exchangeable  correlation  structure  implied  by  a  random-intercepts 
Poisson  GLMM  (see,  e.g.,  Agresti  2002,  pages  564  and  575).  In  Section  2  we  mo- 
tivated and  proposed  correlated  random  effects  {ut}  to  facilitate  other  correlation 
structures.  We  will  now  derive  marginal  correlation  properties  for  a  time  series  of 
counts,  based  on  our  conditional  Poisson  GLMM  approach  using  equally  correlated 
or  autoregressive  random  effects.  This  is  easily  done  by  plugging  in  for  cov{ut,Ut') 
in  (4.5)  above. 

The  equal  correlation  assumption  cov{ut,Ut')  =  o'^p  leads  to  the  marginal 
structure  -        •         .     ^  r  ."■ 

^°"''^'  "^'-^ = [i+miK-i)ni+i5iv.-i(e--ir  •     *  ^ ' 

still  implying  equal  (but  possibly  negative)  correlations.  The  autoregressive  random 
effects  approach  with  cov(itt,Ut.)  =  a"^ p^i=t' '^''  leads  to  a  decaying  (with  time) 
correlation  function 

,„  V  >  _         (E|r.lg|y.-|)'^'('^'''''^'''-''"-i) 
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In  the  case  of  equally  spaced  observations  {dk  =  1  for  all  k),  this  is  more  conve- 
niently written  in  terms  of  the  lag  h  in  between  two  observations: 

fvv     ^  {E[Y,]E[Y,^,]f'  (e-V  -  1) 

corv(Yt,Yt+h)  = 77^; r^-  l^-oj 

^  [1  +  Eme^^  -  l)f'  [1  +  E[Y,,,]{e^^  -  l)f' 

Note  that  if  p  =  1,  i.e.,  perfect  correlation  between  random  effects,  all  correlation 

structures  reduce  to  the  random  intercept  model  with  correlation  structure  (4.6). 

However,  with  \p\  <  1,  and  h  ^  oo,  (4.8)  accommodates  decaying  correlations  and 

with  p  <  0,  (4.7)  accommodates  negative  correlation.  In  Section  5.3,  we  will  fit  a 

Poisson  GLMM  to  a  time  series  of  counts  and  use  the  marginal  properties  derived 

here  to  assess  and  interpret  the  regression  model. 

4.2.1.2     Negative  Binomial  GLMMs 

An  alternative  to  the  Poisson  assumption  as  the  conditional  distribution 
for  the  counts  is  to  use  a  negative  binomial  distribution.  The  negative  binomial 
distribution  per  se  already  allows  for  overdispersion  relative  to  the  mean.  A  second 
source  of  overdispersion  is  then  introduced  by  regarding  the  (log-)  mean  of  a 
negative  binomial  random  variable  as  a  normal  mixture.  Correlated  random  effects 
allow  these  means  to  be  connected  over  time.  Booth  et  al.  (2004)  look  at  negative 
binomial  GLMMs  with  independent  (over  time)  random  effects.  The  anchovy 
larvae  data  analyzed  there  is  a  time  series  of  correlated  counts  and  autoregressive 
random  effects  seem  an  appropriate  alternative  to  the  independent  ones  used  by 
Booth  et  al. 

Using  the  parametrization  of  the  negative  binomial  distribution  as  discussed  in 
Sec.  13.4  of  Agresti  (2002),  let  Yt  be  negative  binomial  with  mean  /xt  and  variance 
fit  +  f^yk,  conditional  on  random  effects  Ut-  (For  fixed  fc,  the  negative  binomial 
distribution  is  a  member  in  the  exponential  family  of  distributions.)  We  consider 
cases  where  the  dispersion  parameter  k  is  the  same  for  all  observations.  As  in  the 
Poisson  GLMM  presented  before,  we  propose  the  following  loglinear  model  for  the 


i 
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conditional  mean  of  Yt  given  uf 

log{E[Yt\ut])  =  logitit)  =x',l3  +  Ut,t  =  l,...,T, 

where  {ut}  follows  one  of  the  random  processes  discussed  in  Chapter  3.  Marginally, 
this  leads  to  the  same  expectations,  variances  and  covariances  for  the  conditional 
log-means  and  means  as  discussed  in  the  Poisson  model  above.  Furthermore,  since 
(4.4)  holds  for  any  loglinear  model,  it  also  holds  for  the  negative  binomial  loglinear 
model  and  the  marginal  means  of  the  Poisson  GLMM  and  negative  binomial 
GLMM  coincide.  However,  the  marginal  variance  under  the  negative  binomial 
assumption  is  given  by 


var(F,)  =  E[fxt  +  f^/k]  +  var(Mt)  =  E[Yt] 


,,^|K.,(*±ie--l) 


which  for  /c  -)•  oo  approaches  the  variance  of  the  Poisson  GLMM.  The  difference 
between  the  variance  under  a  negative  binomial  assumption  and  a  Poisson  as- 

2 

sumption  is  ^  (£'[^t])  •  Similarly,  for  each  one  of  the  random  effects  structures 
discussed  under  the  Poisson  GLMM,  the  formulas  for  the  marginal  correlations 
presented  for  the  Poisson  GLMM  hold  true  when  a^  in  the  denominator  of  each 
equation  is  replaced  by  ^e^^  When  these  implied  marginal  properties  are 
more  plausible  than  the  corresponding  ones  from  a  Poisson  GLMM,  as  judged  for 
instance  by  a  comparison  of  the  approximated  maximized  likelihoods  or  by  a  com- 
parison of  empirical  estimates  to  model  based  estimates,  then  the  negative  binomial 
GLMM  is  a  relevant  alternative. 

Note  that,  as  with  any  GLMM,  the  negative  binomial  GLMM  results  from  the 
hierarchy 

Yt\ut    ~    neg.  bin.  {fit, k),  t  =  l,...,T 
{ui,...,ut)     -    N{0,Eu), 
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and  the  marginal  correlation  between  the  counts  {FJ  arises  because  of  the  cor- 
relations in  the  underlying  random  effects  Ut  which  appear  in  the  model  for  the 
log- means. 
4.2.2     Parameter  Interpretation 

As  long  as  {ut}  is  a  mean-stationary  random  process  (we  assume  a  mean  of 
zero  throughout),  it  follows  from  (4.4)  that  all  parameters  except  the  intercept 
have  equal  interpretations  conditionally  on  random  effects  and  marginally.  For  a 
Gaussian  random  process  with  variance  a^,  the  intercept  itself  is  set  off  by  a  factor 
of  a^/2.  Hence,  all  parameters  except  the  intercept  can  be  interpreted  as  effects 
on  the  conditional  or  marginal  log-mean.  In  particular,  for  any  member  /?,  of  /3, 
e^j  is  the  ratio  of  two  conditional  or  marginal  means  after  a  one  unit  change  in  the 
covariate  associated  with  Pj. 

4.3     Analysis  for  a  Time  Series  of  Binomial  or  Binary  Observations 

Suppose  that  conditional  on  a  time  specific  normal  random  effect  Ut,  ob- 
servations {Yst}%i  are  independent  and  identical  binary  random  variables  with 
conditional  success  probability  7rt{ut)  =  P{Yst  =  1  |  u«),  t  =  l,...,T.  Consequently, 
the  sum  Yt  =  YlT=i^st  has  a  conditional  binomial (ni,7rt(M())  distribution.  Fur- 
thermore, given  random  effects  Ut  and  Ut-  at  two  different  time  points  t  and  t*,  Yt 
and  Yf  are  conditionally  independent.  Using  a  logit  link,  time-specific  explanatory 
variables  {xt}  and  correlated  random  effects  {ut},  we  specify  the  conditional  mean 
structure  of  a  binomial  GLMM  as 

logit(7r,(M,))  =  log  ( -^^^i^  )  =x[/3  +  Ut,    t^l,...,T.  (4.9) 

Correlated  random  effects  allow  correlation  of  conditional  log  odds  over  different 
time  points  or  locations.  For  observed  data  y  =  (2/1, . . . ,  ^t),  the  marginal 
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likelihood  corresponding  to  this  model  is  given  by 

T 

L{^,tP;y)    oc      f    fr[7r,K)p'[l  -  7r,K)]"'-^'^(w;^)  d« 

=     /     exp\y2yt{x't0  +  ut)\t[{l  +  exp{x[l3  +  ut}y'"'g{u;ij})du, 

where  g{u;  ip)  is  one  of  the  random  effects  distributions  of  Chapter  3.  The  integral 
is  not  tractable  and  numerical  methods  such  as  the  MCEM  algorithm  of  Section  2.3 
must  be  used  to  find  maximum  likelihood  estimates  for  0  and  ■0.  The  function  Q^ 
defined  in  (2.16)  now  has  form 

m       T 

where  Uj    is  the  t-th  element  of  the  j-th  generated  sample  u^-'^  from  the  posterior 
distribution  h{u\y).  As  before,  note  that  here  we  discuss  only  the  case  of  a  generic 
time  series  {Yt}  with  no  replication,  hence  n  =  1  (i.e.,  index  i  is  redundant),  and 
rii  =  T  in  the  general  form  presented  in  (2.16).  Again,  if  replications  are  available 
or  in  the  case  where  two  time  series  differ  in  the  fixed  effects  part  but  not  in  the 
random  effects  (e.g.,  have  the  same  underlying  latent  process),  then  one  simply 
needs  to  include  the  sum  over  the  replicates  as  indicated  in  (2.16).  An  example 
where  we  assumed  that  two  series  differ  in  their  fixed  eflfects  parameters  but  share 
the  same  underlying  latent  process  {ut}  is  the  motivating  example  of  Section  3.1, 
with  a  binomial  time  series  for  each,  white  and  black  respondents.  Choosing  one 
of  the  correlated  random  effects  distributions  of  Chapter  3,  the  Gibbs  sampling 
algorithms  developed  in  Sections  3.4.1  or  3.4.2  can  be  used  to  generate  the  sample 
from  h{u\y),  with  f{yt)  having  the  form  of  a  binomial(nt,  7rt(Mt))  density. 
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4.3.1      Marginal  Model  Implied  by  the  Binomial  GLMM 

Marginal  properties  are  harder  to  derive  than  in  the  normal  or  Poisson  case, 
because  the  conditional  mean  -Ktiut)  is  not  a  linear  or  exponential  function  of  the 
•  •'  .-  random  effects. 

Assuming  zero-mean  random  effects  {ut}  with  variance  var(ut)  =  cr^,  the 
*■►  conditional  log  odds  {logit(7rt(«t))}  have  means  {x'f./3}  and  variance  a^.  For 

two  distinct  time  points  t  and  t*,  the  correlation  between  the  conditional  log 
-       ^  odds  at  times  t  and  t*  under  independence,  equal  correlation  or  autocorrelation 

' '.  assumptions  on  the  random  effects  are  given  by  0,  p  or  p^i'=t  ''*,  respectively.  We 

will  refer  to  E{\ogit{nt{ut))]  =  x[l3  as  the  unconditional  or  expected  log  odds,  ones 
that  do  not  depend  on  the  random  effects.  It  is  perhaps  more  natural  to  investigate 
the  unconditional  or  expected  odds  of  success,  E[TTt{ut)/{l  -  itt{ut))],  since 
interpretation  is  on  the  natural  scale.  They  are  given  by  exp{xJ/3  +  (7^/2}  and  any 
member  of  exp{/3}  can  be  interpreted  as  the  change  in  the  expected  (e.g.,  averaged 
over  random  effects)  odds  of  success  for  a  unit  change  in  the  corresponding  member 


of  Xf  Alternatively,  log(7rf^/l  -  n^)  ^  exp(a;j/3/\/l  +  a^)  (derived  in  subsequent 
sections)  is  the  logit  of  marginal  probabilities  implied  by  the  conditional  model  and 
the  effect  of  parameters  are  seen  to  be  down-weighted  when  using  this  function  as 
the  quantity  of  interest.  Often,  this  is  the  preferred  measure,  and  we  will  derive 
it  in  the  next  section.  Further  discussion  of  the  interpretation  of  parameters  in 
GLMMs  with  time  dependent  random  effects  is  provided  in  Section  4.3.3. 

Let's  turn  now  to  the  marginal  distribution  of  Yt,  the  sum  of  Ut  binary  vari- 
ables Ygt,  and  their  dependence  structure  over  time.  At  time  t,  the  binary  variables 
Yst  have  marginal  mean  i:^  =  E[Yst]  =  E[irt{ut)],  variance  var(Yst)  =  vrj^(l  -  ^rf ) 
and  constant  covariance  cov{YsuYs't)  =  var(7rt(ut)),  which  is  a  function  of  a.  By 
sharing  a  common  random  effect  Ut  at  time  t,  observations  {^8*}"=!  ^^^  marginally 
dependent,  with  an  exchangeable  correlation  structure.  Correlated  random  effects 


H- 


f-  - 
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{ut}  induce  a  second,  time  related  dependency:  For  binary  observations  Yst  and 
Ys't'  at  two  different  time  points  t  and  T,  cov {Yst,  Ys-t')  =  cov(7rt(itt),7rt. (uf.)), 
which  depends  on  the  assumed  covariance  of  random  effects  Ut  and  Uf . 

As  a  consequence  of  marginal  dependence  among  the  binary  variables  at  a 
common  time  point  t  (within- type  dependency),  their  sum  Yt  =  Yl^Li^st  shows 
overdispersion  relative  to  a  binomial  random  variable.  Its  mean  and  variance  are 
given  by 

E[Yt]    =    ntTT^ 
var{Yt)    =    n^Trf  (1  -  Trf )  +  ntK  -  l)var(7rt(ixt)), 

which  can  be  evaluated  by  using  methods  of  the  next  section. 

As  a  consequence  of  marginal  dependency  between  binary  variables  at  times  t 
and  t*  (between- type  dependency),  their  respective  sums  Yt  and  Yf  are  correlated 
according  to 

corr(y't,  Yf )  =  ntUfCOv  {nt{ut),nf  (uj. ))  /  [var(y't)  var(yt.  )]^^^  , 

which  again  can  be  evaluated  using  methods  of  the  next  section. 
4.3.2     Approximation  Techniques  for  Marginal  Moments 
4.3.2.1     Approximation  based  on  Taylor  series  expansions 

To  evaluate  moments  of  the  marginal  distribution  of  Yt,  we  will  first  use  a 
second-order  Taylor  series  expansion  of  TTt{ut)  around  the  mean  E[ut]  =  0  of  the 
random  effect  Ut.  It  is  given  by 

,    .       1 exp{-Xt^}  exp{-2a;t;3}  -  expj-x^^}   ^ 

'''^''''  ~  l  +  exp{-x',/3}  +  (l  +  exp{-xi^})2^'  +        2(1  +  exp{-xj^})3        ^*- 
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Using  this  expansion,  we  can  approximate  the  marginal  mean,  variance  and 
covariance  of  the  conditionally  specified  success  probabilities: 

1 


var(7r((iX()) 


l  +  exp{-x[f3} 

exp{-x'tl3} 
[l  +  exp{-x',0}f 


J  ^  exp{-2a;/3}  -  exp{-x[^}  ^^ 
2(l  +  exp{-a;;^})' 


\  2  _4 

l  +  exp{-a;;^}j     2 


2  ,   /  I  -  exp{-x'tl3}  \    a 


(4.10) 
(4.11) 


cov(7r((u(),7rt.(Mf)) 


exp{-(x;  +  a;;.)/3} 


(l  +  exp{-xJ/3})^(l+exp{-xi./3})^ 


(4.12) 


+ 


COv(Mt,«t.) 

(1  -  exp{-a;;;9})  (1  -  expj-xJ./S})  cov(m2,  u\,) 


(l  +  exp{-a;i;9})(l  +  exp{-a;J.^})  4 

For  the  last  two  expressions,  we  used  the  additional  assumptions  that 
£^[«f]  =  0  and  E\iif\  =  3(t^  for  all  t,  which,  for  instance,  holds  for  the  normal 
distribution.  With  independent  random  effects,  cov{ut,ul)  =  0  and  the  covariance 
between  the  success  probabilities  is  zero.  Using  correlated  normal  random  effects, 
cov(«j,  Uj.)  is  equal  to  a'^{p  +  p^/2)  for  equally  correlated  random  effects  and  equal 
to  a'^{p^>'=t  '^k  -\-  p^^k=t  '^'■/2)  for  autoregressive  random  effects.  This  simplifies 
to  a'^{p^  +  p^'*/2)  for  equally  spaced  observations  h  units  apart.  These  results  are 
derived  by  evaluating  the  joint  moment  generating  function  of  the  bivariate  normal 
distribution  of  [ut^Uf). 

Using  the  approximate  expressions  for  the  variance  and  covariance  given  in 
(4.11)  and  (4.12),  the  correlation  between  two  success  probabilities  at  diflferent 
times  t  and  t*  is  approximated  by 


corr(7rt(iX(),7rt.  (tZtO) 


(i-exp{-a;;/3})(i-exp{-x;.^})  covK,.^] 
'°'^^"*'^**^  +  (i+exp{-x;^})(i+exp{-x;./^})  —^- 


,      A-exp{-a;;j9}\^a2 
^^l,i+exp{-x;/3};     2 


1/2 


^         ,   l-exp{-X;./3}   I     ^ 
^^   '    l+exp{-X',./3}  )       2 


1/2- 


.-    -i-vi; 
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Plugging  in  the  expressions  for  cov{ut,ut')  and  cov(w^,Uj.)  according  to  the 
random  effects  assumption  yields  the  approximation  of  the  implied  marginal 
correlation  between  success  probabilities  at  time  t  and  t*. 

Many  authors  (e.g.,  Zeger,  Liang  and  Albert,  1988)  only  use  a  first-order 
Taylor  expansion,  for  which  the  second  terms  in  the  large  squared  brackets  of 
(4.10)-(4.12)  would  vanish.  Then,  the  approximation  for  the  correlation  between 
the  conditional  success  probabilities  at  time  t  and  t*  simplifies  to 

COV  {Ut,Ut') 


COlv{TTt{Ut),TTt'{ut')) 


(t2 


-i*-l. 


and  is  p  for  equally  correlated  random  effects  and  p^fc='  '^''  for  autoregressive  ran- 
dom effects.  In  that  case,  the  success  probabilities  directly  inherit  the  correlation 
properties  from  the  underlying  latent  random  process. 
4.3.2.2     Cumulative  Gaussian  approximation 

An  alternative  approximation  for  E['Kt{ut)]  is  given  by  Zeger,  Liang  and  Albert 
(1988),  who  use  a  cumulative  Gaussian  approximation  to  the  logistic  function 
(Johnson  and  Kotz,  1970,  p. 6)  to  derive 

1  +  exp{-a;;;3/v/l  +  (ca)^} 

where  c  is  a  constant  equal  to  ^^.  They  use  this  expression  for  the  marginal 
mean  together  with  an  approximation  for  the  marginal  covariance  matrix  based  on 
a  first-order  Taylor  series  expansion  of  -ntiy^t)  (outlined  above)  to  motivate  a  GEE 
approach  of  fitting  GLMMs. 

Approximations  based  on  the  Taylor  series  expansion  are  only  accurate  for 
random  effects  close  to  their  mean  value  of  0,  as  measured  by  small  values  of  o. 
Figure  4-1  plots  the  approximations  of  the  marginal  probability  t^^  =  E\Kt{uxf\ 
given  in  (4.10)  over  a  range  of  (-4, 4)  for  the  fixed  part  x\^  of  the  linear  predictor, 
for  various  values  of  a.  The  dotted  curve  in  Figure  4-1  corresponds  to  cr  =  2.5. 


-ri»»»; 
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It  clearly  shows  that  the  approximation  is  useless  for  such  a  value  for  a^  since  the 
approximated  marginal  probability  loses  the  fundamental  property  of  monotonicity 
in  the  linear  predictor.  For  even  larger  values  of  o,  formula  (4.10)  shows  that  the 
approximation  can  be  even  greater  than  1  or  less  than  0.  This  problem  is  also 
not  alleviated  by  including  further  terms  in  the  Taylor  series  or  expanding  the 
series  around  ±.a^  with  the  sign  depending  on  whether  one  expects  a  positive  or 
negative  value  for  ut.  Unfortunately,  large  values  for  o  are  the  rule  rather  than  the 
exception  in  models  with  autoregressive  random  effects.  (See  the  remark  by  Diggle 
et  al.,  2002,  p.  239,  and  the  examples  in  Chapter  5.) 

On  the  other  hand,  (4.13)  does  not  suffer  these  drawbacks.  As  a  increases, 
the  conditional  success  probabilities  have  distributions  increasingly  concentrated 
near  0  and  near  1.  Averaging  over  these  conditional  probabilities,  we  expect  a 
marginal  probability  of  0.5,  which  is  the  limit  of  (4.13)  when  a^  -^  oo.  However, 
approximations  for  the  marginal  variance  and  covariance  are  not  easy  to  derive 
with  the  cumulative  Gaussian  approximation  of  the  logistic  function,  but  these  are 
the  quantities  we  are  interested  in  for  a  comparison  of  model  based  estimates  and 
sample  based  estimates.  The  next  section  mentions  a  connection  of  logit  and  probit 
models  which  can  be  used  to  calculate  all  desired  marginal  properties  for  cases 
where  a  is  large. 
4.3.2.3     Marginal  results  through  probit  models 

In  a  longitudinal  setting  and  for  long  sequences  of  binary  data,  Smith  and 
Diggle  (1998)  propose  similar  models  using  correlated  random  effects  as  presented 
here,  but  pursue  a  different  fitting  approach.  They  derive  marginal  moments 
implied  by  their  model  and  construct  a  marginal  covariance  matrix  of  observations, 
all  with  the  intention  to  use  the  GEE  methodology  for  estimation.  Also,  they  use  a 
probit  link 

T^tiut)  =  P{Y,t  =  l\ut)  =  ^x[0  +  Ut)  (4.14) 
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Figure  4-1:  Approximated  marginal  probabilities  for  the  fixed  part  predictor  value 
x'/3  ranging  from  -4  to  4  in  a  logit  model. 

Approximations  are  based  on  a  second-order  Taylor  series  expansion  (first  panel), 
the  probit  connection  (second  panel)  and  Monte  Carlo  integration  over  the  random 
effects  distribution.  The  4  lines  in  each  panel  correspond  to  cr    =    1, 1.5, 2, 2.5,  with 
the  dotted  line  corresponding  to  a  =  2.5. 


to  model  the  conditional  success  probability  7ft  (^t)  at  time  t,  where  $  is  the 
standard  normal  cdf.  To  distinguish  these  from  their  logit  model  counterparts, 
we  use  ^  and  a  to  denote  the  fixed  effects  parameters  and  variance  component  in 
the  probit  model.  Unlike  the  variance,  the  correlation  is  a  scale  free  measurement. 
Hence,  the  correlation  among  the  conditional  success  probabilities,  as  measured  by 
p,  is  the  same,  whether  measured  on  the  logit  scale  or  the  probit  scale.  Thus,  no 
new  parameter  for  describing  the  correlation  in  the  probit  model  is  needed. 

The  advantage  of  a  probit  link  model  is  that  marginal  means  and  covariances 
can  be  calculated  explicitly  and  no  approximation  is  needed.  Smith  and  Diggle 
(1998)  employ  the  threshold  interpretation  to  derive  these  exact  results.  Using 
similar  arguments,  we  now  intend  to  derive  alternative  approximations  of  marginal 
moments  and  correlations  for  our  logit  link  GLMMs  with  correlated  random  effects. 
These  results  can  then  be  compared  to,  or  used  instead  of  the  approximate  results 
for  the  logit  link  derived  with  the  Taylor  expansion  given  above. 

The  threshold  (or  latent  variable)  interpretation  states  that  Yst  =  1  if  and 
only  if  Tst  <  c  for  a  suitable  threshold  value  c,  where  {Tgt}  are  independent  N{0, 1) 
latent  variables  (and  independent  of  any  random  effect  in  the  linear  predictor). 
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Then,  under  (4.14), 


E[Y,t]    =    P{Yst  =  l) 


since  T,(  —  Ut  has  a  A'^(0, 1  +  a^)  distribution. 

The  threshold  interpretation,  although  mathematical  convenient,  may  some- 
times seem  artificial.  We  next  give  an  exact  proof  of  the  above  result,  without 
using  the  threshold  interpretation.  Similar  proofs  can  be  constructed  for  all  results 
derived  in  this  section.  Note  that 

E[Yst]  =  E,,  [E[Yst  I  Ut]]  =  E^,[7rtiut)]  =  Eu,Mx't~0  +  «t)]- 

Let  g{ut)  denote  the  A'^(0,  a^)  density  of  random  effect  Ut  and  let  (f){z)  be  the 
standard  normal  density.  Then 

"X'l^+ut 


/OO        /"ti 
/  (j){z)  dz  g{ut)  dut 

OO  J  — OO 

/.OO       rX't$ 

=  /         (f>{z  +  Ut)  g{ut)  dz  dut 

</— 00  J  -OO 

rX't$     roo       J 

=     /  /      - —  exp  {-[z^  +  2zut  +  (1  +  l/<T^)Mt ]  /2}  dut  dz. 

J-oo       J-oo  ■^^^ 

The  last  integrand  can  be  recognized  as  the  joint  density  of  a  bivariate  normal 
random  variable  {z,  Ut)  that  is  distributed  according  to 


A^ 


1  +  a 


^"2 


-A 


-a'        a^ 


/} 
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Hence,  the  inner  integral  gives  the  marginal  distribution  of  z,  which  is  A^(0, 1  +  a^). 
Then,  . 


^x'S 


E, 


Ut 


mx[0  +  Ut)]    =     I   '     (l/v^TT^)  ^  (^/Vl  +  ^2)  dz 

J  —00 

=     /  (l){z)  dz 

J —00 


giving  above  result. 

Using  the  threshold  interpretation  again,  the  marginal  joint  moment  of  two 
binary  variables  Y^t  and  Yg>t  observed  at  the  same  time  t  is  given  by 

E[YstYs't]    =    P{Y,tY,,t  =  l) 

=    P{Tst-ut<x[0,Zu-Ut<x't0) 


Q{t,t*) 


(4.15) 


where 

1  +  0-2  COv(Mt,M(.) 

COv(U(,Ut.)  1  +  cr^ 

and  $2  ((a,  b)',  Q{t,  t*))  is  the  probability  that  a  bivariate  zero-mean  random 
variable  with  covariance  matrix  Q{t,t*)  is  less  than  (a,  6)'.  Summing  up,  the 
marginal  properties  of  the  binary  variables  {Yst}"ii  at  time  t  are  explicitly  given  by 


E[Y,t] 

var(r,t) 

cov(Ygt,  Yg.t) 


(4.16) 


-r#^^ 


t ) 
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For  observations  Ygt  and  Ys*f  at  two  different  time  points, 


E[Y,tYs.t']    =    PiYstYsn- =  I) 

=    P{Tst  -ut<  x[0, Tsn*  -  ut,  <  x[.~P) 


This  leads  to  a  marginal  covariance  of 


M 


cov(F,t,n.,.)  =  $2  [{x',~P,x[.~0y,Q{t,t*))  -  n^np 

between  two  binary  observations  at  time  points  t  and  t*.  Plugging  in  different 
forms  for  the  covariance  of  the  random  effects  in  Q{t,t*)  results  in  different 
covariance  structures  among  the  two  binary  observations. 

Now,  let  Yt  =  YlT=i  ^s«  represent  the  sum  over  all  n*  (marginally  dependent) 
binary  variables  at  time  t.  Then,  similar  to  the  logit  link  results  presented  before, 
Yt  is  overdispersed  relative  to  a  binomial  with  mean 

~M 


E[Yt]  =  n,7r/ 


and  variance 


var(r,)  =  n,7rf  (1  -  ir^)  +  nt{nt  -  1)  [$2  ((xj^,  xj^)',  Q{t,  t))  -  {Tt^)'   . 

The  correlation  between  random  effects  implies  a  correlation  between  outcomes  at 
different  time  points  t  and  t*,  resulting  in  a  correlation  between  Yt  and  Yf  of  form 


corr(yt,y,.)    =    cov(yt,rt.)/[var(r,)var(F,.)]'/' 


(4.17) 


=    ntUt' 


$s 


[{x[~0,x',/0y,Q{t,t*))-n^n^]  /[var(y,)  var(y,. )]'/'• 


4.3.2.4     Logit-probit  connection 

In  regular  GLMs  parameter  estimates  in  logit  models  are  roughly  1.6  times 
those  in  probit  models  (Agresti,  2002,  p.  246).  This  number  derives  from  the  fact 
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that  with  a  Unear  predictor  of  form  a  +  /5x,  the  rate  of  change  in  the  success 
probability  7r(x)  is  highest  at  a;  =  -a/P  for  both  the  probit  model  (i.e.,  n{x) 
has  the  form  of  a  normal  cdf)  and  the  logit  model  (i.e.,  n{x)  has  the  form  of  a 
logistic  cdf).  At  this  point,  n{x)  is  equal  to  1/2  for  both  models,  and  the  rate  of 
change,  d-K{x)/dx,  is  equal  to  QAj3  for  the  probit  model  and  0.25/5  for  the  logit 
model.  Hence,  we  get  an  equal  rate  of  change  (at  x  =  —a/P)  in  both  models  when 
the  logit  /3  is  0.4/0.25  =  1.6  times  the  probit  /3.  When  comparing  the  standard 
deviations  implied  by  the  normal  cdf  (which  is  equal  to  l/\P\)  and  the  logistic 
cdf  (which  is  equal  to  7r/\/3|^|),  then  this  factor  between  the  relationships  of 
parameters  in  the  two  models  increases  to  1.8. 

We  exploited  this  connection  between  parameters  in  logit  and  probit  models 
to  construct  Figure  4-2,  where  we  compare  conditional  success  probabilities  based 
on  the  logit  link  (4.9)  and  the  probit  link  (4.14)  for  various  values  for  cr  in  a 
GLMM.  We  rewrote  the  linear  predictor  for  the  logit  GLMM  as  r]t  =  x[0  +  azt, 
where  Zt  is  a  standard  normal  variable  and  a  can  be  interpreted  as  the  regression 
coefficient  for  the  random  effect  Zt-  Then  we  used  the  approximate  connection 
7^"^°  '*  ?a  r/J°^*/l-6  between  parameter  estimates  to  compute  conditional  success 
probabilities  under  both  models.  This  was  done  for  a  random  sample  of  6  Zt^s 
from  a  standard  normal  distribution.  For  each  generated  Zt,  each  panel  in  Figure 
4-2  displays  the  conditional  success  probabilities  based  on  the  logit  link  (straight 
line)  and  (scaled)  probit  link  (dashed  line),  for  t/j°^''  ranging  from  -3  to  3.  The  4 
different  panels  refer  to  4  different  choices  of  a. 

We  clearly  see  that  conditional  success  probabilities  based  on  the  logit  and 
(scaled)  probit  link  are  almost  indistinguishable,  irrespective  of  the  magnitude  of 
a.  The  agreement  is  best  at  conditional  success  probabilities  around  1/2,  which  is 
to  be  expected  based  on  the  derivations  given  above.  But  even  for  small  and  large 
success  probabilities,  the  agreement  is  very  good.  Hence,  with  the  proper  scaling 
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Figure  4-2:  Comparison  of  conditional  logit  and  probit  model  based  probabilities. 
Conditional  success  probabilities  for  logit  (straight  line)  and  probit  (dashed 
line)  link  GLMMs  for  linear  predictor  values  ranging  from  -3  to  3.  Each  pair  of 
(straight, dashed) -curves  in  each  panel  corresponds  to  one  out  of  6  randomly  sam- 
pled random  effects  Zt-  Conditional  success  probabilities  for  probit  link  GLMMs 
use  a  scaled  version  of  the  linear  predictor  for  logit  link  GLMMs  to  adjust  for  dif- 
ferent parameter  estimates  in  these  two  models.  The  four  panels  correspond  to  four 
different  values  of  the  random  effects  standard  deviation,  a  =  1, 1.5, 2  and  2.5. 
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factor  on  fixed  effects  parameters  and  the  standard  deviation  of  the  random  effects, 
a  probit-link  GLMM  corresponds  to  a  logit  link  GLMM.  That  is,  for  any  fixed  Zt, 

7rt{zt,l3,a)  =  \ogit-\x't^  + a zt)  ^^x[^  +  azt)  =  7r((2t,/3,a), 

where  the  left  hand  side  in  the  approximation  refers  to  a  logit  model  for  the 
conditional  success  probabilities  with  parameters  {3  and  a,  and  the  right  hand 
side  refers  to  a  probit  model  for  the  same  conditional  success  probabilities  with 
parameters  ^  —  /3/1.6  and  a  =  (j/1.6.  Taking  expectations  with  respect  to  the 
distribution  of  Zt,  one  would  then  also  expect  that 

E[Trt{zt,l3,a)]    ^    E[nt{zt,~0,a)] 

and  consequently  '  C  , 

n^{0,a)    ^    ^f(^,cr)  =  $(x;3/^r+^).  (4.18) 

This  gives  an  approximation  of  the  marginal  success  probability  of  a  logit  model 
in  terms  of  parameters  from  a  conditional  probit  model.  Graphically,  (4.18)  means 
that  the  average  of  conditionally  specified  logistic  cdfs,  which  does  not  follow  a 
logistic  form  itself,  can  be  approximated  by  the  average  of  subject  specific  normal 
cdfs,  which  does  follow  a  normal  form,  provided  the  correct  parameter  adjustments 
are  made.  The  connection  is  pictured  in  Figure  4-3,  where  the  average  of  100 
conditionally  specified  logistic  curves  y^  YllTi  ^(^i,  ^,  o")  with  Zj  ~  N{0, 2.5)  is 


compared  to  the  cdf  n^  =  <^{x[^/\/l+^)  of  a  marginal  probit  model  with 
adjusted  parameters.  Note  that  the  agreement  is  almost  perfect  although  a  is  large. 

The  upshot  of  this  exercise  is  that  we  can  use  the  exact  formulae  for  marginal 
properties  of  probit-link  GLMMs  to  make  good  approximate  marginal  statements 
in  logit  link  GLMMs.  These  should  be  more  accurate  than  the  Taylor  approxima- 
tions based  on  the  logit  link  in  cases  where  a  is  large.  For  instance,  we  can  use 
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Figure  4-3:  Comparison  of  implied  marginal  probabilities  from  logit  and  probit 
models. 

The  plot  shows  the  average  of  100  conditionally  specified  logistic  curves  (dashed 
line)  generated  by  using  u     ~      A''(0,  2.5)  and  the  marginal  normal  curve  tt^  (solid 
line)  from  a  probit  model  with  adjusted  parameters.  The  plot  is  over  a  linear 
predictor  range  from  -6  to  6.  A  random  sample  of  10  of  the  100  generated  condi- 
tionally specified  logistic  curves  is  also  shown  (grey  dashed  lines). 
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(4.16)  to  derive  the  marginal  success  probability  or  the  marginal  odds  in  logit  link 
GLMMs,  and  (4.17)  to  derive  the  marginal  correlation  between  two  observations  in 
models  where  the  estimate  of  o  is  large. 

The  second  panel  of  Figure  4-1  shows  that  the  approximation  of  the  marginal 
success  probabilities  based  on  the  probit  connection  does  not  suffer  the  drawbacks 
(loss  of  monotonicity,  non-convergence  to  0.5  for  a  -^  oo)  experienced  with  the 
Taylor  based  approximation  approach.  Also,  notice  the  close  connection  between 
the  multiplicative  factor  c  =  ^^  ^  1/1.7  for  cr  in  the  cumulative  Gaussian 
approximation  of  the  marginal  mean  employed  by  Zeger,  Liang  and  Albert  (1988) 
and  the  approximation  using  the  probit  link: 

""^    ~    7T i      L  /T-^^'    ^'  =  Wl-7)'    (Zeger  etal.) 

1  +  exp(-a;;/3/vl  +  f  ) 

and  ^  .    • 

Trf    «    $(x;3/Vl  +  ^2),  ^  =  /3/1.6,  a^  =  ((t/1.6)2    (probit-logit)    (4.19) 

The  first  approximation  makes  a  stronger  statement  in  that  it  says  that  the 
marginal  mean  also  has  the  form  of  a  logistic  regression  model,  with  parameters 
downweighted  by  a  factor  of  vT+~^.  However,  using  the  relationship  between 


probit  and  logit  link  once  more,  the  parameter  vector  3/\/l  +  ^^  for  the  marginal 
probit  model  (4.19)  translates  to  roughly  1.6  x  ~^I\J\  +  b"^  =  ^/\/l  +  a^  for  a 
marginal  logit  model  and 

$(x;^/vTT^) «  — L/r-^,- 

1  +  exp(-a;(/3/vl  +  o^) 
Hence,  both  approximations  show  that  the  marginal  mean  follows  roughly  a  logit 
model.  They  differ  only  by  the  weight  factor  assigned  to  the  random  effects  stan- 
dard deviation.  Notice  that  the  probit-logit  connection  and  derivations  outlined  in 
this  dissertation  are  more  valuable  because  they  also  provide  approximations  to  the 
marginal  variance  and  correlation,  a  key  component  for  time  series  observations. 
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In  summary,  fitting  the  logit  GLMM  allows  for  the  usual  interpretation  of 
parameters  as  (conditional)  effects  on  the  log  odds.  Through  exploiting  connec- 
tions with  models  using  a  probit  link  we  can  give  good  closed  form,  analytical 
approximations  for  marginal  probabilities,  odds  and  correlations.  Note  that  non- 
closed  form  solutions  to  the  marginal  mean,  variance  and  correlation  can  always  be 
obtained  by  integrating  over  the  random  effects  density,  e.g.. 


-TT^     —         f 

"'      -J 


'iTt{ut)g(ut)dut 


and  approximated  by  stochastic  methods  such  as  a  Monte  Carlo  sum  using  the  fit- 
ted random  effects  distribution.  The  third  panel  of  Figure  4-1  displays  these  Monte 
Carlo  averages  (based  on  100,000  draws  from  the  assumed  A'^(0,  a)  random  effects 
distribution)  and  shows  almost  perfect  agreement  with  the  closed-form  approxima- 
tions based  on  the  probit  model  (see  also  Figure  4-3).  For  this  simple  example,  the 
Monte  Carlo  averages  are  easy  to  obtain,  but  considerably  more  simulation  effort 
may  be  necessary  to  yield  good  approximations  for  higher  dimensional  marginal 
probabilities,  such  as  the  occurrence  of  three  consecutive  successes,  or  more  com- 
plicated functions.  The  examples  in  Section  5.4  will  illustrate  this  point  further 
and  make  extensive  use  of  both  of  these  approximation  techniques.  There,  the 
main  use  of  these  approximations  will  be  on  comparing  the  empirical  dependency 
structure  observe  in  the  time  series  to  the  theoretical  one  implied  by  the  model  and 
to  compare  observed  frequencies  in  the  time  series  to  estimated  ones  based  on  our 
proposed  models. 
4.3.3     Parameter  Interpretation 

In  GLMMs  for  binary  data  we  model  conditional  log  odds  given  random 
effects.  By  averaging  with  respect  to  the  random  effects  distribution  on  the 
logit  scale  we  obtain  unconditional  (or  expected)  log  odds.  However,  these  are 
different  from  the  marginal  log  odds  log(7rf^/l  -  n^)  obtained  with  the  marginal 
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probabilities  implied  by  the  conditionally  formulated  model.  In  the  previous  section 
we  derived  the  approximation 


log(7rf /I  -  Trf )  «  exp(x;/3/yrT^) 
and  we  see  that  parameters  can  still  be  interpreted  as  log  odds  ratio,  but  down- 


weighted  by  a  factor  of  \/l  -I-  c'^.  In  the  literature  on  longitudinal  data,  where 
random  effects  pertain  to  subjects,  this  interpretation  is  preferred  when  the  de- 
pendency structure  is  considered  a  nuisance.  However,  for  interpreting  regression 
parameters  in  a  time  series  analysis  based  on  the  GLMMs  outlined  in  this  disser- 
tation, a  preference  is  not  so  clear.  For  related  analysis  of  binary  time  series  via 
hidden  Markov  models,  where  the  distinction  to  GLMMs  is  essentially  the  assump- 
tion of  a  discrete  Markov  process  on  a  few  states  instead  of  an  AR(1)  process  for 
the  latent  random  process  {ut},  MacDonald  and  Zucchini  (1997)  give  unconditional 
interpretations  of  regression  parameters  throughout.  Since  the  previous  section  fo- 
cused on  marginal  interpretations,  this  section  focuses  on  unconditional  ones,  but  it 
is  noted  that  it  may  be  more  natural  to  interpret  a  log  odds  of  average  probabilities 
(i.e.,  think  marginally),  than  an  average  log  odds.  (To  illustrate  the  issue  further, 
in  pre-GLM  times  people  took  logarithmic  transforms  of  the  observations  and  fitted 
a  linear  model  to  their  means.  But  then  one  is  modeling  the  mean  of  the  logarithm 
rather  than  the  logarithm  of  the  mean,  as  would  be  done  with  a  GLM.) 
4.3.3.1     Conditional  and  unconditional  log  odds  and  log  odds  ratios 
The  conditional  log  odds  of  success  at  time  point  t  are  given  by 

logit(7rt(w())  =  x\p  +  Uf 

Integrating  over  the  random  effects  distribution,  the  unconditional  or  expected  log 
odds  are  equal  to  x[j3,  and  y3  can  be  interpreted  as  the  unconditional  or  expected 
change  in  the  log  odds  for  a  change  in  the  covariates.  The  central  100(1  -  «)%  of 


\ 


no 

the  distribution  of  the  log  odds  falls  in  between 

The  conditional  log  odds- ra^«o  of  success  at  time  point  t*  over  one  at  time 
point  t  is  given  by 

logit(7rt.(ut.))  -  logit(7rt(Mt))  =  {x[,  -  x[)^  +  {uf  -  Ut). 

With  correlated  random  effects,  the  interpretation  of  the  log  odds  ratio  is  time 
specific  not  only  through  the  covariates  but  also  through  the  random  term  [uf  — 
Ut).  This  is  different  from  the  so  called  "subject-specific"  interpretation  of  the  log 
odds  ratio  in  a  regular  random  intercepts  model  {ut  =  u  for  all  t).  There,  the 
random  effect  is  assumed  to  be  constant  over  time  and  cancels  out  in  the  log  odds 
ratio  and  fi  can  be  directly  interpreted  as  the  change  in  the  conditional  log  odds  for 
a  change  in  the  covariates.  With  correlated  random  effects,  ^  is  the  change  in  the 
conditional  log  odds  for  a  change  in  the  covariates  when  random  effects  at  time  t* 
and  t  have  the  same  value. 

The  unconditional  or  expected  log  odds  ratio  is  equal  to  {x\.  —  xJ)/3  and  /3  can 
alternatively  be  interpreted  as  the  expected  change  in  the  log  odds  for  a  change  in 
the  covariates  between  time  t  and  t*.  The  central  100(1  —  q;)%  of  the  distribution 
of  the  log  odds  ratio  falls  in  between 


(X(.  -  x[)l3  ±  V2\/2(<^^-cov («(.,««)), 

which  can  be  estimated  by  plugging  in  ML  estimates  for  (3  and  the  random  effects 
variance  and  covariances. 


Ill 

4.3.3.2     Conditional  and  unconditional  odds  and  odds  ratios 

At  time  point  t  with  associated  random  effect  ut,  we  already  define  the 
unconditional  or  expected  odds  of  success  as  '     • 

E[exp{x'tl3  +  ut}]  -=  exp{x[/3  +  a'^/2} 

Here,  fi  describes  the  effect  of  covariates  on  the  expected  odds  of  success,  with  the 
intercept  term  offset  as  in  Poisson  GLMMs.  ,.■    . 

For  two  time  points  t*  and  t  with  associate  random  effects  Uf  and  Ut,  the 
ratio  of  expected  odds  is  equal  to  exp{(a;(.  —  x\)^),  and  can  be  interpreted  as  any 
regular  odds  ratio.  I.e.,  for  a  positive,  one  unit  change  in  the  A;-th  predictor  from 
time  t  to  time  T,  the  expected  odds  of  success  at  time  t*  are  exp{;9fc}  times  those 
at  time  t. 

Due  to  the  non-linearity  of  the  odds,  the  ratio  of  expected  odds  is  different 
from  the  expected  odds  ratio,  which  is  given  by 

exp{icJ./3  +  Uf} 


E 


exp{(xj.  -  a;()/3}exp{(7^(l  -  corr(ut.,Ut))}. 


exp{xJ/3  +  ut} 

Using  this  measure,  exp{;9fc}exp{(7^(l  —  covx{ut*^Ut))}  now  equals  the  change 
in  the  expected  odds  ratio  of  success  at  time  t*  versus  time  t,  for  a  positive,  one 
unit  change  in  the  fc-th  predictor  for  that  time  span.  The  first  measure,  exp{/3fe}, 
describes  the  change  in  the  expected  odds  at  two  time  points,  the  second  one, 
exp{;9fe}exp{(T^(l  -  corr(tt(.,iit))}  describes  the  expected  change  in  the  odds  of  a 
success  at  the  two  time  points.  In  the  following,  we  focus  on  the  ratio  of  expected 
odds.  >     - 

4.3.3.3     Multiple  time  series 

If  n  different  time  series  y^  =  (yn, . . . ,  2/i„J,  ...,yi  =  {vn,  •  ■  • ,  Vim),  •••,!/„  = 
(2/ni)  ■  •  • )  ynnr)  ^^^  observed  and  the  same  latent  process  {ut}J=i  is  assumed  to 
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underly  each  one  of  them,  the  conditional  log  odds  at  time  t  have  form 

\ogit{TTitiut))  =  x'nl3  +  ut,    i  =  l,...,n. 

Here  ■Kuiut)  is  the  conditional  probability  of  success  at  time  t  for  the  i-th  series 
and  depends  on  time  specific  covariates  Xu  plus  a  serially  correlated  random  time 
effect  Ut.  If  two  different  time  series  y^  and  y^  represent  different  subpopulations  or 
stratifications  of  a  population,  interest  can  focus  on  each  one  of  the  following  three 
contrasts: 

•  Contrasts  between  subpopulations  at  a  given  common  observation  time 

•  Contrasts  between  different  time  points  within  the  same  subpopulation 

•  Contrasts  between  subpopulations  at  different  observation  times 

We  will  look  at  ratios  of  expected  odds,  which  is  perhaps  the  most  natural  metric, 
to  address  these  three  points.  ; 

In  the  first  case,  the  expected  odds  of  success  in  strata  i  over  the  ones  in  strata 
j,  at  a  fixed  time  t  are  given  by 

exp{a;:,/3  +  aV2}/exp{x;.,/3  +  (1^2}  =  exp{ix[,  -  x'jt)/3}. 

Then,  exp{^}  has  the  interpretation  of  a  change  in  the  expected  odds  for  a  change 
in  the  strata  covariates  at  fixed  time  t.  For  example,  with  model  (3.1), 

\ogit{7rit{ut))  =  logit(P(y'it  =  Vit  \  ««))  =  a  +  /SiXu  +  /522;u  +  l33X2i  +  PaXuX^j  +  Ut 

for  the  two  time  series  measuring  attitude  towards  homosexual  relationships  for 
whites  and  blacks,  exp{/33  +  l^iXit]  describes  the  change  in  the  expected  odds  of 
approval  of  homosexual  relationships  for  blacks  versus  whites  in  year  xu-  That  is, 
in  year  x\t  the  expected  odds  of  approval  for  black  respondents  are  exp{^3  +  ^iXu} 
times  the  expected  odds  of  approval  for  white  respondents.  Using  the  maximum 
likelihood  estimates  and  their  estimated  asymptotic  standard  errors  (see  Table  5-1) 
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and  covariances,  the  expected  odds  of  approval  of  homosexual  relationships  for 
black  respondents  in  1988  are  estimated  to  be  0.65  times  (95%-Confidence  Interval: 
[0.54,0.73],  using  the  Delta  method)  the  expected  odds  for  white  respondents 
in  that  year.  Ten  years  later,  in  1998,  this  factor  decreases  to  0.49  with  a  95% 
Confidence  interval  of  0.37  to  0.61. 

For  the  scenario  in  the  second  contrast,  the  ratio  of  expected  odds  at  time  t* 
versus  time  t  for  subpopulation  i  is  given  by  ■ 

exp{(a;^t.  -  x'n)^}. 

Now,  exp{/3}  describes  the  change  in  the  expected  odds  for  changes  in  the  pre- 
dictors from  time  t*  to  time  t.  For  the  motivating  example  with  model  (3.1), 
exp{/3i/i  +  P2{x'it+h  ~  ^it)}  is  ^^^  change  in  the  expected  odds  of  approval  for  white 
respondents  and  exp{{Pi+  P4)h+ P2{xlt_^.h  —  xl^)}  is  the  change  in  the  expected  odds 
of  approval  for  black  respondents,  for  observations  h  years  apart.  For  example,  over 
a  period  of  10  years  from  1988  to  1998,  the  expected  odds  of  approval  increase  by 
a  factor  of  2.63  for  white  respondents  (95%  Confidence  Interval:  [1.56,  3.69],  using 
the  Delta  method)  and  by  a  factor  of  2.01  (95%  Confidence  Interval:  [1.44,  2.58]) 
for  black  respondents. 

For  the  third  contrast, 

exp{(a;i(.  -  a;}J^} 

describes  how  the  expected  odds  at  time  t*  in  strata  i  compare  to  the  expected 
odds  at  time  t  in  strata  j.  Again,  exp{^}  describes  the  effect  of  a  change  in 
the  strata  covariates  from  time  t  to  t*  on  the  expected  odds.  In  the  motivating 
example,  the  expected  odds  of  approval  for  black  respondents  in  year  Xu  +  h  are 
exp{Pih  +  P2[{xit  +  hY —  xl^)+03  +  l^4{xlt  +  h)}  times  those  for  white  respondents  in 
year  x^.  Using  the  maximum  likelihood  estimates,  the  expected  odds  of  approval 
for  black  respondents  in  1998  is  estimated  to  be  0.99  times  (i.e.,  almost  equal  to) 
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the  expected  odds  of  approval  for  white  respondents  6  years  back  in  1992.  Here, 
I  fixed  the  year  1998  (i.e.,  xu  +  h  =  15)  and  searched  for  the  number  of  years  h 
one  has  to  go  back  to  match  the  expected  odds  of  approval  for  the  two  races.  I.e.,  I 
solved  for  h  in  the  equation 

$ih  +  ^2[(15)'  -  (15  -  hf]  +0S  +  /34I5  =  0, 
yielding  h  ^  Q. 


CHAPTER  5 
EXAMPLES  OF  COUNT,  BINOMIAL  AND  BINARY  TIME  SERIES 

In  this  chapter  we  propose  GLMMs  with  autocorrelated  random  effects  for 
the  analysis  of  several  practical  examples.  We  will  apply  the  likelihood  estimation 
theory  developed  in  Chapter  3  and  use  the  model  theoretic  properties  derived  in 
Chapter  4.  In  Section  5.2  we  take  another  look  at  the  GSS  data  set  discussed  in 
Chapter  3  and  re-analyze  it  based  on  a  normal  approximation,  using  linear  mixed 
model  theory.  A  famous  time  series  of  counts  is  analyzed  in  Section  5.3,  with 
results  in  the  literature  compared  to  our  results  based  on  an  autoregressive  GLMM. 
Two  binary  time  series  are  analyzed  next  in  Section  5.4.  The  first  one  considers 
299  consecutive  eruptions,  which  are  classified  as  either  short  or  long,  from  the  Old 
Faithful  geyser  in  Yellowstone  National  Park.  The  second  one  considers  the  annual 
boat  race  between  teams  from  the  Universities  of  Cambridge  and  Oxford  and  is 
challenging  because  of  several  missing  observations.  Two  goals  in  this  example 
are  to  establish  the  influence  of  weight  of  the  crew  on  the  outcome  of  the  race 
(demystifying  a  long  held  believe)  and  to  predict  a  future  outcome.  First,  though, 
in  Section  5.1  we  present  ways  to  explore  and  picture  the  dependency  structure  in 
an  observed  discrete  time  series. 

5.1      Graphical  Exploration  of  Correlation  Structures 

Let  {yt}  be  a  realization  of  the  time  series  {Yt}.  In  practice,  we  have  to  choose 
an  appropriate  model  for  {yt}  based  on  information  the  data  provides  about  the 
dependency  structure.  An  important  tool  to  explore  the  dependency  structure  of 
the  observed  time  series  is  the  sample  autocorrelation  function  (ACF).  For  equally 


115 


rjFr>"-,-:!'  -  ■,«', 


iipj': 


116 

spaced  data,  it  is  defined  as 

_ri.^     11^=1  {yt+h-y){yt-y)  /. -,^ 

T>t=iiyt-yy 

where  h  is  the  lag  between  observations  and  y  is  the  sample  mean.  If  the  observed 
time  series  displays  any  trend,  we  have  to  first  estimate  it  (maybe  by  fitting  a 
regular  GLM)  and  subsequently  explore  the  autocorrelations  among  the  residuals. 
A  comparison  of  the  autocorrelation  the  model  predicts  with  the  empirical  one 
observed  in  the  time  series  serves  as  a  crucial  check  on  the  adequacy  of  the  fitted 
model  and  its  assumptions. 
5.1.1     The  Variogram 

For  unequally  spaced  data,  the  variogram  (Diggle,  1990)  is  a  better  measure 
to  describe  the  association  than  the  ACF.  In  Diggle  et  al.  (2002)  the  variogram  is 
discussed  for  longitudinal  data,  while  we  develop  it  here  for  the  special  case  of  time 
series  data  {Yt}.  Define  the  variogram  ^{h)  at  lag  h  as 

If  {Yt}  is  stationary,  the  variogram  is  directly  related  to  the  autocorrelation 
function  p{h)  =  corr(Ft,  Yt+h)  by 

7(/i)  =  T^  [1  -  p{h)] , 

where  r^  is  the  variance  of  Yf.  (Even  for  a  non-stationary  time-series,  the  vari- 
ogram is  well  defined  provided  the  increments  Yt+h  -  Yt  are  stationary.)  To  develop 
the  empirical  variogram,  let  dtf  be  the  time  in  between  two  observations  yt  and  y^ . 
The  sample  analog  g{h)  of  the  variogram  at  lag  h  is  calculated  by  averaging  over  all 
possible  squared  differences  between  observation  pairs  h  time  units  apart.  I.e., 
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where  C/,  =  {{t,t*)  :  dw  =  h}  is  the  set  of  all  index  pairs  {t,t*)  with  corresponding 
observations  measured  h  time  units  apart.  A  comparison  of  the  sample  variogram 
g[h)  to  an  estimate  ^{h)  of  the  theoretical  variogram  implied  by  a  particular  model 
serves  as  a  check  on  the  adequacy  of  the  model.  In  Section  5.4  we  show  with  the 
help  of  the  variogram  the  appropriateness  of  the  modeled  correlation  structure  for 
the  unequally  spaced  Oxford  versus  Cambridge  boat  race  time  series  data. 
5.1.2     The  Lorelogram 

For  categorical,  especially  binary  responses,  the  dependency  can  also  be 
measured  in  terms  of  odds  ratios.  For  a  binary  time  series  {F(},  Heagerty  and 
Zeger  (1998)  define  the  lorelogram  9{h)  at  lag  h  as  the  log  odds  ratio  between 
observations  Yt  and  Yt+hi 

For  an  observed  binary  time  series  y  =  {yi,...,yt),  the  lorelogram  can  be  esti- 
mated by  using  sample  proportions  of  the  probabilities  in  (5.2).  I.e.,  the  sample 
lorelogram  at  lag  h  is  given  by 

LOR(h)  =  log  /^^!i>^-fe]^[^+^-^l  ^  ^^^-^  ~  ^[i.T-fe])'(lr-/.  -  y[h+i,T]) 

\y[i,T-h]('^T~h  -  y[h+i,T])  X  (1t-/i  -  y[i,T-h])'y[h+i,T] 

where  y^^^^  is  the  sub- vector  {Va,-  ■  ■,  Vb)  of  y  and  Ir-h  is  a  row  vector  oiT  —  h 
ones.  Proper  adjustments  have  to  be  made  for  unequally  spaced  data.  As  with  the 
variogram,  a  comparison  of  the  sample  lorelogram  to  one  implied  by  a  particular 
model  serves  as  a  check  on  the  adequacy  of  the  model. 

5.2     Normal  Time  Series 
Section  4.1  discussed  a  linear  mixed  model  approach  of  time  series  modeling 
for  normal  data.  In  this  section,  we  illustrate  with  the  example  about  attitudes 
towards  homosexual  relationships  (a  cross-sectional  time  series,  see  Section  3.1), 
where  attitude  was  measured  16  times  in  between  1974  and  1998  for  whites  and 
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blacks.  The  binomial  counts  in  this  study  are  large  enough  to  warrant  an  analysis 
based  on  a  normal  approximation,  although  the  binomial  sample  sizes  are  about  8 
to  11  times  larger  for  whites  than  for  blacks  for  almost  all  years. 

Initially,  we  will  assume  that  conditional  on  random  time  effects  {mJ,  the  log 
odds  9it  of  approval  of  homosexual  relationships  for  race  i  at  year  t  are  independent 
over  the  years  and  follow  a  normal  distribution  with  mean  //jj  +  ut  and  standard 
deviation  r.  (We  could  have  also  modeled  the  binomial  counts  or  the  proportions 
of  approval  directly,  but  prefer  the  log  odds  approach  because  of  the  structural 
problem  of  the  identity  link  with  modeling  proportion  data.)  However,  the  usual 
assumption  about  a  constant  standard  deviation  r  throughout  the  two  groups  is 
grossly  inappropriate.  Firstly,  we  have  to  account  for  the  fact  that  sample  sizes  in 
the  white  group  are  much  larger  than  sample  sizes  in  the  black  group.  Secondly, 
the  estimated  asymptotic  standard  deviation  of  the  log  odds,  derived  by  the  delta 
method  and  based  on  the  asymptotic  normality  of  the  sample  proportions  is 

^d{eu)  =  [nitTTuil  -  Trit)]-^/^  (5.3) 

where  ttu  is  the  sample  proportion  for  race  i  and  year  t.  Figure  5-1  shows  these 
estimates  for  the  two  groups.  To  put  the  scale  for  the  empirical  standard  deviation 
for  this  plot  into  perspective,  the  estimated  log  odds  range  from  -1.8  to  -0.9  for  the 
white  group  and  from  -2.8  to  -1.4  for  the  black  group.  Figure  5-1  shows  that  the 
variability  in  the  log  odds  is  markedly  smaller  for  white  respondents  than  black 
ones  due  to  the  overall  larger  sample  sizes  in  the  white  group.  Furthermore,  the 
variability  in  the  log  odds  is  not  constant  over  time,  especially  in  the  black  group. 
This  is  due  to  the  fact  that  in  the  years  1988  to  1991  and  1993  only  about  half  as 
many  people  were  sampled  as  in  the  other  years  for  both  groups.  Also,  the  trend  in 
the  probabilities  ttu  causes  the  standard  deviations  to  be  different  over  time.  In  an 
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Figure  5  1:  Empirical  standard  deviations  std(^it)  for  the  log  odds  of  favoring 
homosexual  relationship  by  race. 
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effort  to  remedy  all  these  effects  simultaneously,  we  associate  a  weight 

Wit  =  [riimtil  -  T^it)Y^^ 

with  each  observation  that  is  the  inverse  of  the  empirical  standard  deviation  given 
in  (5.3).  It  might  then  be  reasonable  to  assume  that  the  weighted  log  odds  WitOu 
have  constant  conditional  standard  deviation  std{witOit  \  Ut)  =  r,  or,  stated 
differently,  that  std(^i(  |  Ut)  =  t/wu. 

We  are  now  able  to  specify  an  autoregressive  model.  Analogous  to  Section  3.1, 
we  assume  the  following  linear  mixed  model  of  the  log  odds: 

6'it  =  a  +  l3iXu  +  I32xlt  +  fizX2i  +  P4XuX2i  +  Mf  +  (-it, 

where  as  before  Xu  is  the  (centered)  year  the  response  was  measured,  2:21  is  an 
indicator  for  race,  Ut  is  the  year-specific  random  effect  and  en  are  assumed  i.i.d 
N{0,T^/wit).  With  the  motivation  given  in  Section  3.1,  we  assume  autocorrelated 
random  effects  Ut+i  =  p'^'Ut  +  e*  to  model  the  serial  dependence  in  successive 
observations  caused  by  gradual  change  in  unobserved  factors  such  as  public 
opinion.  We  used  the  procedure  proc  mixed  in  SAS  to  obtain  the  maximum 
likelihood  estimates  for  all  parameters  in  this  model,  which  are  shown  in  Table 
5-1.  The  mixed  procedure  allows  for  the  use  of  the  weights  1/wit  in  the  variance- 
covariance  matrix  of  the  conditional  log  odds  given  the  random  effects  through 
the  weight  statement.  We  selected  the  so-called  spatial  power  covariance  matrix 
(SAS  proc  mixed  with  random  statement  option:  type=sp(pow)  ('year'))  for  the 
covariance  matrix  of  the  random  effects,  because  it  allows  for  unequally  spaced 
observations.  The  results  confirm  the  ones  we  saw  earlier  using  a  logistic  regression 
approach  with  autoregressive  random  effects  to  model  the  serial  log  odds.  Since 
parameter  estimates  in  both  models  are  almost  equal,  substantially  the  same 
conclusions  as  given  in  Section  4.3.3  based  on  the  logit  GLMM  are  reached.  One 
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Table  5-1:  Comparing  estimates  from  two  models  for  the  log-odds. 

normal  logistic 


Param. 

est. 

s.e. 

est. 

s.e. 

a: 

-1.80 

0.07 

-1.80 

0.07 

Pi- 

0.025 

0.007 

0.023 

0.007 

h- 

0.0039 

0.0008 

0.0041 

0.0009 

Pz: 

-0.32 

0.057 

-0.33 

0.07 

/54- 

-0.027 

0.007 

-0.027 

0.009 

r: 

0.67 

- 

- 

or: 

0.11 

•'    i' 

0.10 

0.03 

P 

0.66 

0.65 

0.25 

Maximum  likelihood  estimates  and  asymptotic  standard  errors  based  on  an  ap- 
proximate normal  linear  mixed  model  for  the  log  odds  (first  two  columns)  and  the 
logistic  GLMM  with  autocorrelated  random  effects  of  Section  3.1. 

advantage  of  the  normal  approximating  model  is  the  easy  with  which  marginal 
statements  can  be  obtained.  For  instance,  formula  (4.2)  applies,  but  now  with  a 
weight  factor.  That  is, 

is  the  marginal  correlation  between  the  log  odds  of  approval  of  race  i  for  observa- 
tions in  years  t  and  t*.  The  marginal  correlations  depend  on  the  weight  factors  and 
therefore  vary  throughout  the  years.  For  instance,  using  the  maximum  likelihood 
estimates  of  the  variance  components,  the  estimated  correlation  between  the  log 
odds  of  approving  homosexual  relationship  in  years  1996  and  1998  is  0.52  for  whites 
and  0.48  for  blacks. 

5,3     Analysis  of  the  Polio  Count  Data 
Table  2  in  Zeger  (1988)  lists  a  time  series  of  168  monthly  counts,  most  of 
them  small,  of  new  cases  of  poliomyelitis  in  the  U.S.  between  1970  and  1983, 
plotted  in  the  first  bin  of  Figure  5-2.  Of  interest  is  whether  the  data  provide 
evidence  of  a  long-term  decrease  in  the  rate  of  polio  infections.  Many  authors 
have  analyzed  this  data  set,  beginning  with  Zeger  (1988)  who  uses  marginal  model 
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fitting  techniques  (cf.  Section  1.2).  He  estimates  a  yearly  decrease  of  6.3%  in  new 
cases  of  poliomyelitis,  where,  however,  a  95%  confidence  interval  ranges  from  a 
decrease  of  12.2%  to  an  increase  of  1.1%.  We  demonstrate  our  technique  by  re- 
analyzing this  data  using  a  GLMM  with  autoregressive  random  effects  (ARGLMM) 
to  incorporate  the  time  dependence  between  adjacent  counts.  Conditional  on  a 
random  time  effect  Ut,  let  Yt  be  a  Poisson  variable  representing  the  count  in  year  t, 
f  =  1, . . . ,  168.  Following  Zeger  (1988),  we  model  the  log  of  the  conditional  mean  of 
Yt  as 

\og{E[Yt\ut])^\og{fit)    =    a  +  /3i(t-73)/1000  +  ^2Cos(27rt/12)  +  /53sin(27rVl2) 

+  ^4  cos(27rf/6)  +  p5  sin{2'rrt/6)  +  Ut,  (5.4) 

where  the  random  effects  follow  the  autoregressive  process 


ut+i  =  put  +  et,    tt  ~  Ar(0,  a^/l^),  u^  ~  A^(0,  a). 

The  sine  and  cosine  pairs  adjust  for  annual  and  semi-annual  seasonal  patterns  of 
the  counts  displayed  in  Figure  5-2. 

Figure  5-3  shows  the  convergence  of  selected  parameter  estimates  and  their 
estimated  standard  errors  in  an  MCEM  algorithm,  for  two  different  sets  of  starting 
values  for  the  variance  components.  Convergence  parameters  were  set  to  ei  = 
0.002,  c  =  5,  e2  =  0.01,  eg  =  -0.001,  a  =  1.04  and  q  =  1.1  (see  Section  2.3.3). 
All  maximum  likelihood  estimates  and  their  standard  errors  are  presented  in  Table 
5-2  together  with  estimates  from  other  models.  A  negative  binomial  model  takes 
overdispersion  into  account,  but  treats  the  observations  as  independent.  Similarly, 
a  Poisson  GLMM  with  independent  random  effects  ut  i.i.d.  N{0,a)  only  adjusts 
for  overdispersion  in  the  counts,  but  does  not  address  the  dependency  among 
the  observations.  The  Poisson  ARGLMM  has  the  smallest  maximized  likelihood 
among  these  models  and  uses  only  one  parameter  more  than  a  negative  binomial  or 


123 


9.  ^ 


>  Q  ^      OB  !.  O  A  '.  9    dB    009B  O  9  Q  9  Q  Q  ffi    .' *  9  O    ^  *    9  >•'  ^9 ' '.!  ''  OBOO  00 .'  O  d    '•     0  9. 

.   I   g"g  I   I   iji'a  io'   i'a»''   <>"»'8"g''i»'»'i»(i*iiiti"h»'*n  1^'  Hi  '  a '»'   '    iai'lii  '   ^  ii6 


^ 

O   o  y, 

n?  t 

? 

?9      9  ^ 


Hi' .  a  I 


1970 
10 


1971   1972   1973   1974   1975   1976   1977   1978   1979   1980   1981   1982   1983   1984 


p 

o 

0 
o 

o   0   y^ 

t^.c^,) 

-        A   oo                o 

A 

A                     o 

0 

o 

0 

/   0     o 

f\sfo\        00     .o         1 

0 

0 

00 

/\ 

o 

i 

/  o      to     /-qA       A   1 

\ 

O            0     o         o     o 

o/\ 

0 

,^o\ 

\o\f\ 

OO                 0      o 

°  1 

.oJ>              \^     00  ^^/V) 

o\  C5 

BO^Op^O  0J5^0/\0  OO^^.^dV^^/'"^^^^ 

(co^/So  \ 

oj>/ 

\j 

'    ^\j 

^\^^^ 

P-^--.  0  o  9-^_.^\p 

om/ 

-J-*- 

o-e-4e-oooooloooo' OB  Iq'  oeq   ' 

D  'O'     '     'g 

e*KV. 

-^i-m 

'     '     '     ^0^ 

ooooqI  ODBtt 

t'  ootolToooo  'o^ 

)fe  '0    1 

1970   1971   1972   1973   1974   1975   1976   1977   1978   1979   1980   1981   1982   1983   1984 
10 


r- 

0 

o 

o   O   y,                       H,^ 

o 

o 

o 

OO                       0 

o 

0 

o              o              0 

0 

■          O    OQ              '^          ° 

o 

O                           00 

o                                                        o 

/VV  /V\  r^ 

\ 

r^y\    r~y\    /-SA     ?sT 

\     ^  A     J"A  °  ° 

f\ 

OO                0     0          o 

\Ji            \yo    00  \q/    o  c 

\9t 

ooooo^o/oo     y^o    o  \oy        a 

i,iyr^\^€^\y^ 

>  V 

5/"^-6oh^^y'0-^^^^^;O^^O^^^^Ol»^ 

looy   ■   ■   '   DO  ■   ■   'o^o  00' 

Lrn^ 

ooiboooocolooro  OB 'o'  coo 

1     D  '0'     '     'OODOO    1     1  (O  '     '     ' 

-Je- 

1970  1971   1972   1973   1974   1975   1976  1977   1978  1979  1980  1981   1982  1983   1984 

Figure  5-2:  Plot  of  the  Polio  data. 
The  first  bin  shows  the  observed  time  series  of  counts  of  polio  infections  from  1970 
to  1984.  Not  shown  is  the  observed  count  of  14  in  November  1972.  The  second  bin 
shows  the  observed  counts  as  circles,  and  superimposes  the  fitted  conditional  model 
(5.4),  where  posterior  means  are  used  to  estimate  the  random  effects.  The  third  bin 
shows  the  fit  of  the  marginal  model  implied  by  the  conditional  formulation. 
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Figure  5-3:  Iteration  history  for  the  Polio  data. 
The  plot  shows  the  iteration  history  for  parameters  /?i,  a  and  p  and  their  estimated 
asymptotic  standard  errors  for  the  Poisson  ARGLMM.  The  Iteration  number  is 
plotted  on  the  x-axis.  The  two  different  lines  in  each  plot  correspond  to  two  dif- 
ferent sets  of  starting  values  for  a  and  p.  The  starting  value  for  /3i  was  its  GLM 
estimate.  Final  Monte  Carlo  sample  sizes  in  the  MCEM  algorithm  were  29,070  and 
34,005,  respectively  for  the  two  different  sets  of  starting  values. 
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Table  5-2:  Parameter  estimates  for  the  polio  data. 


Poisson- 

Poisson- 

Chan& 

GLM 

Neg. 

Bin. 

GLMM 

ARGLMM 

Ledolter 

est. 
0.21 

s.e. 
0.08 

est. 

s.e. 

est. 

s.e. 

est. 

s.e. 

est. 

s.e. 

a 

0.21 

0.10 

-0.05 

0.11 

-0.03 

0.15 

0.21 

0.13 

^i 

-4.80 

1.40 

-4.33 

1.85 

-4.34 

1.92 

-3.74 

2.91 

-4.62 

1.38 

^2 

-0.15 

0.10 

-0.14 

0.13 

-0.13 

0.13 

-0.10 

0.15 

0.15 

0.09 

^3 

-0.53 

0.11 

-0.50 

0.14 

-0.51 

0.14 

-0.50 

0.16 

-0.50 

0.12 

^4 

0.17 

0.10 

0.17 

0.13 

0.17 

0.13 

0.20 

0.13 

0.44 

0.10 

/35 

-0.43 

0.10 

-0.42 

0.13 

-0.38 

0.13 

-0.36 

0.13 

-0.04 

0.10 

1/k 

0.57 

0.16 

a 

„ 

0.72 

0.10 

0.70 

0.12 

0.64 

P 

0.66 

0.20 

0.89 

0.04 

LL 

-132.49 

-113.37 

-112.40 

-92.50 

19.00 

Fit  of  a  Poisson  GLM,  a  negative  binomial  GLM,  a  Poisson  GLMM  with  indepen- 
dent random  effects  and  a  Poisson  ARGLMM.  The  last  column  holds  parameter 
estimates  as  reported  by  Chan  and  Ledolter  (1995).  Their  estimates  of  a  and  p  are 
transformed  to  bring  them  into  agreement  with  our  parametrization  of  the  latent 
autoregressive  process. 

Poisson  GLMM  with  independent  random  effects.  Any  information  criterion,  such 

as  the  AIC,  would  heavily  favor  this  model. 

5.3.1      Comparison  of  ARGLMMs  to  other  Approaches 

Chan  and  Ledolter  (1995)  also  used  autoregressive  random  effects  in  a  Poisson 
GLMM  setting  to  analyze  these  data.  However,  their  implementation  of  the 
MCEM  algorithm  seems  to  have  been  stopped  prematurely.  The  path  plot  of 
the  coefficient  of  interest  /3i  (Chan  and  Ledolter,  1995,  p.  246)  still  shows  some 
trend  movements  when  they  declared  convergence.  Their  convergence  criterion  is 
based  on  the  change  in  the  marginal  log-likelihood,  while  ours  takes  into  account 
the  consecutive  changes  in  parameter  estimates  and  the  Q-function.  (Moreover, 
their  estimate  of  the  maximized  log-likelihood  function  of  19  seems  to  be  a  rather 
unusual  value,  considering  our  estimate  of  -92.5.  This  is  excluding  the  constant 
term  -  Xlilog(?/«0  i^  ^'^^  log-hkelihood,  which  is  equal  to  -140.5.  We  were  not 
able  to  reproduce  the  estimate  with  the  parameter  estimates  given  by  Chan  and 
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Ledolter  (1995).)  Also,  our  Monte  Carlo  sample  for  the  final  iterations  in  the 
MCEM  algorithm  is  28,500,  about  14  times  higher  than  theirs.  The  Monte  Carlo 
sample  size  increased  exponentially  in  our  implementation,  but  only  two  sample 
sizes  of  800  and  then  2000  for  confirming  the  results  obtained  with  the  800  samples 
were  used  in  Chan  and  Ledolter  (1995).  Based  on  their  estimate  of  ;9i  and  its 
asymptotic  standard  error,  they  conclude  that  the  time  trend  is  significant  at  the 
5%  level.  However,  our  analysis  shows  an  insignificant  time  trend  at  the  5%  level, 
which  is  in  agreement  with  the  conclusion  of  Zeger  (1988)  and  other  approaches 
that  use  different  ways  of  incorporating  correlation  into  a  Poisson  model.  For 
instance,  Fahrmeir  and  Tutz  (2001)  use  a  transitional  model  including  the  past 
5  responses  and  report  a  p- value  of  0.095  for  the  test  of  a  zero  slope  for  the  time 
trend.  Li  (1994)  fitted  a  slightly  different  transitional  model  and  also  reported  an 
insignificant  time  trend  after  accounting  for  the  autocorrelation.  He  noted  that  the 
series  might  be  too  short  to  establish  significance  of  a  linear  time  trend.  Benjamin 
et  al.  (2003)  used  the  negative  binomial  instead  of  the  Poisson  as  the  conditional 
distribution  in  a  transitional  model,  and  reported  a  better  fit  with  it,  but  again  an 
insignificant  time  trend. 

Note  that  in  general,  regardless  of  its  significance,  the  time  trend  parameter 
in  transitional  models  fitted  by  the  above  authors  has  to  be  explained  conditional 
on  past  responses  (or  functions  involving  past  responses).  In  our  model,  however, 
e^i  can  simply  be  interpreted  as  the  linear  time  effect  on  the  marginal  mean  of 
the  polio  counts.  Let  /xf^  =  E[Yt]  denoted  the  marginal  mean  of  Yt,  as  given  by 
expression  (4.4).  Then  with  our  model  (5.4),  the  ratio  Att^i2//^i^  of  two  marginal 
means  exactly  one  year  apart  is  equal  to  e^^'^'/^™".  Using  the  ML-estimate  of  /3i, 
we  then  estimate  a  yearly  decline  of  4.5%  in  the  marginal  polio  counts.  However,  a 
95%  confidence  interval  for  this  parameter  ranges  from  a  yearly  decline  of  10.6%  to 
a  yearly  increase  of  2.1%,  including  the  possibility  of  no  change. 
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At  a  given  time  point  t,  a  negative  binomial  GLM  (with  the  parametrization 
used  in  Section  4.2)  implies  that  the  variance  exceeds  the  mean  /xf^  by  a  factor 
of  (1  +  n^ /k).  For  the  Poisson  ARGLMM  we  saw  that  this  overdispersion  factor 
equals  [1  +  fj.f^ie'^^  -  1)].  Using  the  estimates  from  Table  5-2,  1/k  equals  0.57  and 
e^  —1  equals  0.63.  Both  models  seem  to  propose  a  similar  amount  of  overdispersion 
relative  to  their  estimated  marginal  means.  Also,  both  models  are  similar  in  the 
sense  that  they  assume  the  same  overdispersion  parameter  {k  in  the  negative 
binomial  case,  a  in  the  ARGLMM  case)  for  all  observations.  However,  the  negative 
binomial  model  does  not  adjust  for  correlation  among  the  responses. 
5.3.2     A  Residual  Analysis  for  the  ARGLMM 

We  assess  the  quality  of  our  model  by  a  residual  analysis.  For  the  random 
intercepts  GLMM  and  the  ARGLMM,  we  define  the  residual  at  time  t  as  rt  =  yt  - 
fxf^,  where  fj,f  is  the  marginal  mean  (4.4).  Figure  5-4  shows  the  autocorrelation 
function  of  the  estimated  residuals  h  =  Vt  -  P't^  based  on  the  fit  of  a  negative 
binomial  GLM  and  a  Poisson  ARGLMM.  The  autocorrelation  function  of  residuals 
from  the  fit  of  a  GLMM  with  independent  random  effects  is  omitted  from  the 
plot  since  it  is  almost  identical  to  the  one  from  the  negative  binomial  GLM. 
While  the  significant  (based  on  an  asymptotic  standard  error  of  i/l/T  =  0.08) 
residual  autocorrelation  at  lag  1  shows  the  inappropriateness  of  the  negative 
binomial  GLM  and  the  Poisson  GLMM  with  independent  random  effects,  the 
autocorrelation  function  of  the  residuals  is  as  expected  for  the  Poisson  ARGLMM. 
(We  could  also  judge  significance  by  a  Monte  Carlo  experiment  where  we  look 
at  the  smallest  and  largest  lag  1  correlation  of  1000  reordered  time  series  of  the 
residuals.  If  the  observed  lag  1  correlation  of  0.25  falls  close  to  or  above  the  upper 
bound,  the  correlation  is  deemed  significant.)  The  model-implied  autocorrelation 
function  pt{h)  =  coTT{rt, n+h)  =  coii{yt,yt+h)  for  the  Poisson  ARGLMM  is 
given  by  (4.8)  and  depends  on  the  time  t.  For  a  given  lag  /i,  we  took  the  average 
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p{h)  =   Yeizfi  J^tPtW  ^^  ^1'  possible  lag  h  model-based  autocorrelations  to 
construct  an  estimate  of  the  correlation  at  that  lag.  The  line  with  filled  triangles 
in  Figure  5-4  represents  this  average  of  model-based  autocorrelations.  It  seems 
reasonably  close  to  the  observed  autocorrelation  in  the  residuals  from  the  fit  of  the 
Poisson  ARGLMM,  especially  at  the  most  important  first  two  lags.  (A  generalized 
estimating  equations  approach  with  a  marginally  specified  AR(1)  correlation  matrix 
gives  a  very  similar  estimate  (0.24)  of  the  marginal  lag  1  correlation.  However, 
it  is  questionable  if  the  marginal  AR(1)  correlation  is  justified  since  both  the 
transitional  model  and  the  ARGLMM  indicate  a  longer  dependence  relationship.) 
The  estimated  model-based  autocorrelation  function  p  indicates  that  marginal 
correlations  die  out  for  observations  3  or  more  month  apart. 

The  residual  analysis  also  reveals  an  extreme  observation  (a  count  of  14  new 
cases  in  Nov.  1972)  which  was  deemed  insignificant  by  Zeger  (1988),  but  was 
addressed  by  Chan  and  Ledolter  (1995)  who  added  an  additional  parameter  to 
the  model.  Benjamin  et  al.  (2003)  calculated  a  conditional  tail  probability  of  0.02 
for  observing  an  event  that  extreme  or  even  more  extreme  under  their  negative 
binomial  transitional  model.  Based  on  our  Poisson  ARGLMM,  we  estimate  a 
conditional  mean  of  9.1  (cf.  Figure  5-2)  for  an  observation  at  that  time-point, 
using  the  posterior  mean  of  the  random  effect  for  Nov.  1972  as  its  prediction.  This 
translates  to  an  estimated  conditional  tail  probability  of  0.08  for  observing  14  or 
more  new  counts  of  poliomyelitis  for  that  month,  which  does  not  seem  too  extreme. 
However,  if  we  base  calculations  on  the  marginal  probability  of  observing  an  event 
like  this,  then  the  probability  is  equal  to  0.007.  The  marginal  mean  count  for  Nov. 
1972  is  estimated  to  be  2.6  (cf.  Figure  5-2),  but  the  Poisson  distribution  cannot 
be  used  to  calculate  tail  probabilities  since  marginally  the  counts  are  not  Poisson. 
Hence,  we  used  Monte  Carlo  integration  to  estimate  the  marginal  distribution 
P(yNov.i972  =  k),  A;  =  1, . . . ,  13  from  the  conditional  Poisson  model  by  sampling 


129 


0.30 


0.25  - 


0.20 


0.15 


0.10 


0.05 


0.00 


-e-O-  Neg.  Bin.  -A-^  AR-GLMM 

p{h)  under  Neg.  Bin.  -*-*-  p(h)  under  GLMM 
2xASE 


Figure  5-4:  Residual  autocorrelations  for  the  Polio  data. 
The  plot  shows  autocorrelation  functions  of  residuals  from  the  fit  of  a  negative 
binomial  model  (circles)  and  a  Poisson  ARGLMM  (triangles).  Also  shown  is  an  es- 
timate of  the  model-based  autocorrelation  function  when  the  assumed  model  is  the 
negative  binomial  GLM  (filled  circles)  or  the  Poisson  ARGLMM  (filled  triangles). 
The  straight  dotted  line  represents  the  asymptotic  standard  error  -/l/T  for  the 
correlation. 
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from  the  fitted  random  effects  distribution  and  averaging  over  the  conditional 
Poisson  densities  specified  through  (5.4)  to  get  above  result. 

If  we  decide  to  eliminate  the  extreme  observation  (and  in  a  more  general 
context,  if  we  eliminate  several  more  observations  or  some  observations  are  missing) 
then  the  methods  about  unequally  spaced  time  series  developed  in  Chapter  3  are 
useful  and  directly  applicable.  Here,  however,  we  add  an  additional  parameter 
corresponding  to  an  indicator  function  I{t  =  Nov.  1972)  for  this  observation  to 
adjust  for  the  extreme  observation  (in  regular  GLMs,  this  would  force  a  perfect 
fit  for  the  outlying  observation).  The  final  sample  size  in  the  MCEM  algorithm 
with  the  additional  parameter  increases  to  56,900  and  some  parameter  estimates 
are  affected  by  this  adjustment.  For  instance,  the  slope  estimate  now  equals 
$1  —  —2.49  (s.  e.  =  3.66)  and  the  lag  1  correlation  among  the  conditional  log-means 
increases  to  p  =  0.86  (s.  e.  =  0.09).  Their  standard  deviation  decreases  to  a  =  0.59 
(s.  e.  =  0.15).  The  estimate  for  the  coefficient  of  the  indicator  function  for  the 
outlier  equals  1.84  with  a  standard  error  of  0.49.  There  is  now  even  less  evidence 
of  a  decreasing  time  trend  in  the  observed  time  series.  Figure  5-5  is  the  same 
as  Figure  5-4,  but  now  based  on  the  model  with  the  adjustment  for  the  extreme 
observation. 

5.4     Binary  and  Binomial  Time  Series 

This  section  analyzes  two  data  sets  of  binary  time  series,  illustrating  regression 
parameter  estimation,  checking  for  model  appropriateness  and  predicting  future  ob- 
servations. Various  implied  model  properties  concerning  the  marginal  distribution 
are  also  derived. 
5.4.1      Old  Faithful  Geyser  Data 

MacDonald  and  Zucchini  (1997)  propose  a  two-state  hidden  Markov  model  (for 
a  brief  summary  of  the  hidden  Markov  model,  see  Section  1.3)  for  data  concerning 
the  eruption  times  of  the  Old  Faithful  geyser.  The  series  consists  of  299  consecutive 
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Figure  5-5:  Residual  autocorrelations  with  outlier  adjustment  for  the  Polio  data. 
This  plot  is  the  same  as  the  one  in  Figure  5-4,  but  is  now  based  on  models  with  an 
extra  parameter  to  adjust  for  the  extreme  observation  of  November  1972. 
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observations  between  the  1st  of  August  and  the  15th  of  August,  1985.  Most 
observations  can  be  characterized  as  either  long  or  short,  with  very  low  variation 
within  the  long  and  short  group.  In  fact,  some  of  the  eruption  times  measured  at 
night  are  only  recorded  as  being  either  short  or  long.  MacDonald  and  Zucchini 
(1997),  following  Azzalini  and  Bowman  (1990),  transform  the  series  into  a  binary 
one,  with  cutoff  point  defined  at  an  eruption  length  of  3  minutes.  They  analyze  the 
series  assuming  a  discrete  two-point  mixture  for  the  probability  of  a  long  eruption, 
where  the  mixture  depends  on  the  states  an  underlying,  two  state  Markov  chain 
is  in.  Here  we  attempt  an  approach  assuming  an  underlying  normal,  first-order 
autoregressive  process.  Let  yt,t  =  1, . . . ,  299  be  the  discretized  observations, 
with  a  value  of  0  indicating  eruption  times  less  than  3  minutes  (short  eruptions) 
and  a  value  of  1  otherwise  (long  eruptions).  The  sample  ACF  r(/i)  of  the  series  is 
pictured  in  Figure  5-6  with  numerical  values  given  in  Table  5~3.  It  clearly  shows 
signs  of  negative  autocorrelation  in  the  series.  Note  that  the  autocorrelations, 
with  increasing  lag,  do  not  decay  geometrically,  so  a  first-order  Markov  model  (a 
transitional  model)  is  inappropriate.  Marginal  correlations  implied  by  a  GLMM 
with  autoregressive  random  effects,  as  developed  in  Section  4.3.1,  seem  capable  of 
capturing  such  a  behavior.  Let  {ut}^^i  be  autocorrelated  random  effects  in  a  model 
for  the  conditional  probability  nt{ut)  of  a  long  eruption.  In  particular,  the  model 
has  form 

logitintiut))  =  a  +  uu    Ui~7V(0,(7),    Ut  =  put-i  +  e^    et"^  N{0,a^/l  -  fp), 

where  a  is  a  parameter  for  the  unconditional  (or  expected)  logit  of  a  long  eruption 
and  the  series  {ut}  captures  the  serial  dependency.  Conditional  on  these  random 
eflfects  we  assume  that  successive  eruption  lengths  are  independent.  The  MCEM 
algorithm  with  Gibbs  sampling  from  the  posterior  distribution  of  the  random 
effects  as  described  in  Sections  2.3  and  3.4.1  is  used  to  obtain  maximum  likelihood 
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estimates  for  this  model.  Since  a  and  p  are  the  standard  deviation  and  the  lag 
1  correlation  of  the  conditional  logits,  we  use  the  sample  standard  deviation  and 
sample  lag  1  correlation  of  the  empirical  logits  log(  j^^Jffg)  as  starting  values  for 
the  parameters.  The  starting  value  for  a  is  the  ML-estimate  of  it  under  a  GLM 
model  assuming  independent  observations.  Alternatively,  the  GLIMMIX  macro 
in  SAS,  which  maximizes  a  normal  approximation  of  the  marginal  likelihood  (see 
Section  2.2)  but  allows  to  incorporate  autocorrelated  random  effects  can  be  used. 
The  two  sets  of  starting  values  and  some  more  we  tried  yielded  similar  parameter 
estimates  and  standard  errors.  Using  the  convergence  parameters  ci  =  0.001,  c  =  5, 
£2  =  0.003,  €3  =  -0.001,  a  =  1.02  and  g  =  1.1  (see  Section  2.3.3),  we  obtained 
the  following  estimates:  a  =  1.24  (0.56),  a  =  3.69  (1.17),  p  =  -0.89  (0.037). 
The  algorithm  converged  after  130  iterations  with  a  final  Monte  Carlo  sample 
size  of  m  =  56, 000.  The  values  in  parenthesis  are  obtained  from  a  Monte  Carlo 
approximation  of  the  observed  Information  matrix  (Louis  1982)  with  a  Monte  Carlo 
sample  of  100, 000  drawn  from  the  fitted  posterior  random  effects  distribution. 
5.4,1.1     Marginal  Properties 

Since  the  estimate  of  a  is  large,  we  use  formula  (4.17)  to  derive  marginal 
correlations.  Let  a  and  a  denote  the  maximum  likelihood  estimates  of  o:  and 
a  scaled  by  a  factor  of  1.6.  The  calculation  of  marginal  correlations  enables  a 
comparison  with  the  sample  autocorrelation  function  of  the  observed  series  and 
gives  valuable  information  about  the  fit  of  the  model.  From  (4.17),  the  marginal 
autocorrelations  are  estimated  by 

_                     $2((^,6)',Q(i,£  +  /i))-(7r^)2 
Ph  -  corr(yt,  yt+h)  = ^m(i  _  ^m) '    h-1,2,... 


where 


TT^  =  $  (&/\JC^)^  +  l] 


134 

Table  5-3:  Autocorrelation  functions  for  the  Old  Faithful  geyser  data. 
h:  123456789        10 


Q{t,t  +  h) 


r{h)  :     -0.54     0.48    -0.35     0.32    -0.26     0.21     -0.16     0.14    -0.17    0.16 
p{h)  :     -0.48     0.46     -0.37     0.35     -0.29     0.27     -0.23     0.21     -0.18     0.17 

Comparison  of  numerical  values  of  the  sample  ACF  r{h)  and  estimated  ACF  p{h) 
based  on  a  logistic-normal  ARGLMM. 

is  the  time-invariant  estimated  marginal  probability  of  a  long  eruption  time  (equal 
to  0.62)  and 

is  the  estimate  of  covariance  matrix  (4.15)  for  a  bivariate  zero-mean  normal  random 
variable  with  cdf  $2- 

The  plot  in  Figure  5-6  with  numerical  values  given  in  Table  5-3  shows  good 
agreement  of  these  model  based  estimates  of  the  autocorrelation  function  and  the 
sample  autocorrelation  function.  Similar  good  agreement  between  the  model  and 
the  data  can  be  seen  from  Figure  5-7,  which  plots  the  empirical  lorelogram  against 
its  model  counterpart.  Note  that  the  empirical  lorelogram  is  not  defined  at  lag  1, 
since  the  sequence  of  two  consecutive  short  eruptions  was  not  observed.  The  plot 
also  shows  the  asymptotic  standard  error  (ASE)  for  the  log  odds  ratio  at  each  lag. 
It  should  be  mentioned  that  similar  good  results  were  also  achieved  with  the  hidden 
Markov  model  approach  by  MacDonald  and  Zucchini  (1997). 

The  probit  approach  is  not  the  only  way  to  calculate  marginal  probabilities 
and  correlations,  although  it  provides  closed  form  approximations.  Integrating  the 
conditional  success  probability  over  the  marginal  distribution  of  the  random  effect 
Ut  gives  the  marginal  success  probability  at  time  t.  Similarly,  integrating  the  joint 
conditional  distribution  of  two  successes,  two  failures  or  one  success  and  one  failure 
over  the  two  dimensional  distribution  of  the  random  effects  vector  {uuUf)  gives 
the  corresponding  marginal  joint  probabilities.  For  instance,  as  we  already  briefly 
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Figure  5-6:  Autocorrelation  functions  for  the  Old  Faithful  geyser  data. 
Comparison  of  the  sample  ACF  r{h)  (triangles)  and  estimated  marginal  model 
based  ACF  p{h)  (squares)  based  on  a  logistic-normal  ARGLMM  for  the  Old  Faith- 
ful geyser  data.  The  straight  dotted  lines  represents  the  asymptotic  standard  error 
y/ljr  for  the  correlation. 
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Figure  5-7:  Lorelogram  for  the  Old  Faithful  geyser  data. 
Comparison  of  the  empirical  lorelogram  LOR(/i)  (triangles)  and  estimated  lorel- 
ogram p{h)  (squares)  based  on  a  logistic-normal  ARGLMM  for  the  Old  Faithful 
geyser  data. 
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mentioned,  a  unique  feature  to  this  series  is  that  every  short  eruption  is  followed 

by  a  long  one,  in  other  words  the  sequence  {yt,yt+\)  =  (0,0)  is  not  observed 

for  any  t.  This  is  not  a  structural  zero  as  there  is  no  a-priori  reason  why  a  short 

eruption  cannot  be  followed  by  another  short  one,  although  Azzalini  and  Bowman  ■ 

(1990)  mention  a  geophysical  interpretation  which  makes  this  quite  unlikely.  We 

estimate  the  marginal  joint  probability  P{Yt  =  0,  Fi+i  =  0)  of  two  consecutive  short 

eruptions  as 

P{Yt  =  0,  Yt^i  =  0)  =  $2  ((-A,  -&)',  Q{t,  t  +  l))=  0.031  (5.5) 

based  on  the  probit  model  approximation  using  the  2-dimensional  multivariate 
normal  cdf,  and  as 

1     "*  -1  -1 

P{Yt  =  0,Yt+i  =  0)  =  -"^(^l  +  exp{a  +  u['^}^      (l +  exp{d  + wJ^+\})      =0.036 

j-i 

(5.6) 

based  on  a  Monte  Carlo  sum  approximation  of  the  two-dimensional  integral  with 
m  =  500, 000  samples  from  the  estimated  joint  distribution  of  {ut,  Ut+i). 

With  a  straightforward  extension  of  the  results  presented  in  Section  4.3.2, 
we  can  calculate  joint  probabilities  with  the  probit  model  approach  for  longer 
sequences  of  long  and  short  eruptions.  For  example,  the  joint  probability  of 
observing  a  long  eruption  at  time  t,  followed  by  a  short  eruption  and  another  long 
one  is  given  by 

P{Yt  =  l,Yt+i  =  0,Yt+2  =  l)    =    P{Tt<a  +  ut,Tt^i>a  +  Ut+uTt+2<a  +  Ut+2) 

=    P{Wt  <  a,  -Wt+i  <  -a,  Wt+2  <  a) 
=    ^3{{a,-a,a)',Q{t,t  +  l,t  +  2)), 

again  using  the  threshold  interpretation  Yt  =  I  <^  Tt  <  a  +  Ut-  Here,  d  is  .  i 

the  intercept  term  of  a  probit  model,  Ws  =  Tg  —  Us,  s  =  t,t  +  l,t  +  2  and  $3  | 

4 


Q{t,t  +  l,t  +  2)  = 
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is  the  cdf  corresponding  to  the  multivariate  normal,  mean  zero  random  vector 
^\Y^^  -Wt+i,  Wt+2)'  with  variance-covariance  matrix 

l  +  iaf     -{d^p       (a)V 
l  +  {af     -{afp 

.     1 + i^r 

For  a  model  based  estimate  of  the  probability  of  this  particular  sequence  of  three 
consecutive  observations  we  simply  plug  in  maximum  likelihood  estimates  of  the 
parameters  appearing  in  $3(.)  and  Q(t,i  +  l,i  +  2).  Estimates  of  resulting  counts 
for  all  possible  combinations  up  to  order  three  are  displayed  in  Table  5-4,  using 
both  the  probit-logit  connection  and  Monte  Carlo  integration.  The  results  are 
very  similar  using  both  types  of  approximation,  which  speaks  for  the  quality  of  the 
closed  form  probit  based  approximations  derived  in  the  previous  Chapter. 
5.4.1.2     Exchangeability  of  certain  sequences 

The  probit  connection  also  helps  in  explaining  certain  symmetries  in  the  model 
when  no  time  varying  covariates  are  present.  In  the  geyser  example,  the  conditional 
probabilities  only  depend  on  an  intercept  term.  Returning  to  sequences  of  length 
two,  the  probability  of  the  event  {Wt,  -Wt+i)'  <  {a,  -a)'  is  the  probability  of  a 
long  eruption  at  time  t  followed  by  a  short  one  at  time  t  +  1.  The  symmetry  of  the 
(mean-zero)  bivariate  normal  distribution  with  the  special  form  of  the  variance- 
covariance  matrix  (most  notably  equal  variances)  demands  that  this  event  has  the 
same  probability  as  the  event  {-Wt,Wt+i)'  <  (-a,  a)'.  The  probability  of  the 
latter  event  is  associated  with  the  probability  of  a  short  eruption  followed  by  a  long 
one.  Hence,  the  estimated  marginal  probabilities  are  the  same  for  a  sequence  of  a 
long  eruption  followed  by  a  short  or  a  short  eruption  followed  by  a  long.  This  is 
reflected  in  Table  5-4  which  compares  observed  and  expected  counts  for  several  ,.| 

more  sequences  of  long  and  short  eruptions.  " 
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For  three  consecutive  eruptions,  symmetry  in  the  trivariate  (mean-zero)  normal 
distribution  of  {Wt,  Wt+i,  Wt+2)'  demands  that  the  events 

{Wt,Wt+i,-Wt+2)'  <{&,c^,-&)'  and  (-V^„  l^t+i,W^t+2)' <  (-«,«,«)' 

and  the  events 

{-Wt,-Wt+uWt+2y  <  {-&,-a,ay  and  W, -W^t+i, -W^t+2)' <  («, -a, -a)' 

have  equal  probability.  Hence,  the  model  based  marginal  probabilities  of  two  long 
eruptions  followed  by  a  short  one  (1,1,0)  and  a  short  one  followed  by  two  long  ones 
(0,1,1)  are  equal.  Accordingly,  the  marginal  probabilities  of  two  short  and  a  long 
eruption  (0,0,1)  is  the  same  as  the  probability  of  one  long  eruption  followed  by 
two  short  ones  (1,0,0).  Again,  this  symmetry  is  reflected  in  the  expected  counts 
presented  in  Table  5-4.  It  can  be  interpreted  as  an  exchangeability  property  for 
certain  sequences  of  long  an  short  eruptions  when  no  time-varying  covariates  are 
present.  E.g.,  denoting  the  event  of  two  consecutive  long  eruptions  with  A  and  one 
short  eruption  with  B,  our  model  suggests  that  the  probability  distribution  of  AB 
is  the  same  as  the  one  for  BA. 
5.4.1.3     Technical  derivation  of  the  exchangeability  property 

It  is  not  immediately  obvious  why  these  pairs  of  events  have  equal  prob- 
ability under  our  model.  Following  is  a  proof  for  the  fact  that  the  two  events 
{Wt,Wt+i,-Wt+2y  <  {a,a,-ay  and  {-Wt,Wt+i,Wt+2y  <  (-«,«,«)'  have  equal 
probability:  Without  loss  of  generality,  assume  t  =  I  and  let  W  =  (W^i,  W2,  W3)' 
have  a  trivariate  normal  distribution  with  mean  zero  and  variance-covariance 
matrix  S,  where  E  has  diagonal  elements  all  equal  to  tr^  and  equal  covariances 
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cri2  =  cr23.  Then  the  density  function  of  W  is  proportional  to 

f{wuW2,W3)    oc    exp{   {wl  +  wDia'^  -  aj^)  +  wlia^  -  als) 

+{wi  +  W3)  [2w2ai2{an  -  <^^)]  +  2wiWz{al2  -  crVia)  y 

From  this  expression  it  is  straightforward  to  derive  the  corresponding  expressions 
for  the  densities  of  {W^,  W2,  -W3)'  and  {-W3,  W2,  Wi)'.  Then,  it  can  be  be  shown 
algebraically  with  a  simple  transformation  argument  that  these  two  random  vectors 
have  identical  densities,  i.e.,  f{wi,W2,  -ws)  =  f{-wz,W2,wi).  (Notice  the  way  in 
which  wi  and  w^  enter  the  density  above.)  Now,  the  first  event  (1^1,1^2,  -W3)'  < 
(a,  a,  -a)'  has  probability 

/—a     pa       pa 
I  I        J{wi,W2,-Wz)dWidW2dW3 

00    «/— 00  J  —00 
/—a     pa       pa 
/  /        f{-W3,W2,Wi)dWidW2dW3 

•00  •/ — 00  J  —00 
/a       pa       p—a 
/  /         f{-W3,W2,Wi)dWzdW2dWi 

•00  •/ — 00  J  —00 
/a       pa       p—a 
/  /         f{-Vi,V2,V3)dVidV2dV3, 

-00  </ — 00  J  —00 

where  we  used  Fubini's  theorem  and  in  the  last  step  simply  renamed  the  variables 
(i.e.,  the  transformation  Wi  =  ^3,^2  =  V2,W3  =  vi).  However,  this  last  probability 
is  the  probability  of  the  event  that  the  random  vector  {-Wi,  W2,  W^)'  is  less  than 
{-a,  a,  a)',  quod  erat  demonstrandum.  The  proof  for  the  equivalence  of  the  other 
pair  of  three  consecutive  eruptions  is  similar.  Also,  the  case  of  the  equivalence 
of  the  marginal  probabilities  of  a  short  eruption  followed  by  a  long,  and  a  long 
followed  by  a  short  is  handled  with  similar  arguments  using  the  bivariate  normal 
distribution. 

Symmetry  occurs  also  with  the  Monte  Carlo  approach  for  approximat- 
ing marginal  probabilities,  since  the  distribution  of  {ut,Ut+i)  and  {ut+uUt)  are 
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equivalent.  Therefore,  a  random  sample  {Uf,  m|+i),  j  =  1, . . . ,  m  from  the  joint 
distribution  of  (ut,  Ut+i)  is  also  a  random  sample  from  the  distribution  of  {ut+i,  Ut). 
Consequently,  Ut  and  Ut+i  can  be  used  interchangeably  and  the  Monte  Carlo 
approximation  to  P{Yt  =  1,  Yt+i  =  0), 

1  ^^     exp{Q!  +  iij   }  1 

"^  7^  1  +  exp{a  +  uj^'}  1  +  exp{a  +  uJi\} 

with  samples  {u^\u^l-^)  is  equivalent  to 

1  y^     expla  +  nj^'j 1 

"i  ^  1  +  exp{a  +  mS^\ }  1  +  exp{a  +  u\^^} ' 


1 


which  is  the  approximation  to  P{Yt  =  0,  Yt+i  =  1).  The  last  column  in  Table 
5-4  displays  the  expected  numbers  based  on  a  Monte  Carlo  approximation  of  the 
marginal  probabilities. 

Another  word  of  caution:  Simply  multiplying  the  estimated  marginal  prob- 
ability of  a  sequence  by  the  sample  size  of  299  consecutive  eruptions  to  obtain 
the  expected  count  is  wrong  and  can  lead  to  estimated  counts  larger  than  what 
is  possible  in  a  sequence  of  299  observations.  We  calculate  expected  counts  of  a 
particular  sequence  by  multiplying  the  estimated  marginal  probability  for  that 
sequence  with  the  number  of  possible  consecutive  sequences  of  length  two  or 
three.  E.g.,  there  are  297  possible  sequences  of  three  consecutive  eruptions  for 
the  Old  Faithful  data  set.  Hence,  with  an  estimated  marginal  probability  of 
P{Yt  =  ^,Yt+i  =  i,Yt+2  =  1)  =  0.1703  for  three  consecutive  long  eruptions,  we 
expect  a  count  of  297  x  0.1703  =  50.6  such  sequences  in  the  time  frame  observed  for 
that  series.  This  distinction  gets  more  important  for  shorter  or  unequally  spaced 
data  as  will  be  demonstrated  with  the  next  example. 

Note  that  the  logistic  ARGLMM  uses  only  3  parameters.  The  only  cells 
showing  some  lack  of  fit  in  Table  5  4  are  the  ones  which  involve  two  or  more 
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Table  5-4:  Comparison  of  observed  and  expected  counts  for  the  Old  Faithful  geyser 
data. 

expected  counts 
observed  counts     probit     Monte  Carlo 

long  eruptions  (1)                 194  185.7  185.0 

short  eruptions  (0)  105 113.3  114.0 


299 

299 

299 

from  1  to  1 

89 

81.4 

81.9 

from  1  to  0 

105 

103.6 

102.6 

from  0  to  1 

104 

103.6 

102.6 

from  0  to  0 

0 

9.3 

10.9 

299 

299 

299 

from  1  to  1  to  1 

54 

50.6 

50.6 

from  1  to  1  to  0 

35 

30.6 

31.0 

from  1  to  0  to  1 

104 

96.1 

94.1 

from  1  to  0  to  0 

0 

7.2 

8.1 

from  0  to  1  to  1 

35 

30.6 

31.0 

from  0  to  1  to  0 

69 

72.7 

71.5 

from  0  to  0  to  1 

0 

7.2 

8.1 

from  0  to  0  to  0 

0 

2.0 

2.6 

299  299  299 


The  Table  compares  observed  counts  of  short  and  long  eruptions  and  various  tran- 
sitions with  those  expected  under  a  logistic  ARGLMM  for  the  Old  Faithful  geyser 
data. 

consecutive  short  eruptions,  an  outcome  that  was  not  observed  in  the  given  time 
span.  However,  our  model  assigns  a  very  small  probability  for  this  event. 
5.4.2     Oxford  versus  Cambridge  Boat  Race  Data 

In  this  illustration  we  consider  the  outcome  of  the  annual  boat  race  between 
teams  representing  the  University  of  Oxford  and  the  University  of  Cambridge. 
The  first  race  took  place  in  1829  and  was  won  by  Oxford,  and  the  last  race  (at  the 
time  of  writing)  took  place  in  2003  and  was  won  by  Oxford  for  the  second  time 
in  a  row.  Overall,  Cambridge  holds  a  slight  edge  by  winning  77  out  of  148  races 
(52.0%).  Two  races  were  held  in  1849,  one  in  March  and  one  in  December.  Since 
all  other  races  are  traditionally  held  in  late  March  or  early  April,  we  treat  the 
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Figure  5  8:  Plot  of  the  Oxford  vs.  Cambridge  boat  race  data. 
Squares  at  0  and  1  are  the  outcomes  of  the  individual  races,  where  a  square  at  1 
stands  for  a  Cambridge  win.  The  jagged  line  connects  the  estimates  of  the  condi- 
tional success  probabilities  7r({tt)  over  time. 


result  of  December  1849  as  the  result  for  1850,  when  no  race  took  place.  There 
are  26  years,  such  as  during  both  world  wars,  when  the  race  did  not  take  place. 
These  are  1830-1835,  1837/38,  1843/44,  1847/48,  1851,  1853,  1855,  1915-1919  and 
1940-1945.  No  special  handling  of  these  missing  data  is  required  with  our  methods 
of  maximum  likelihood  estimation.  In  1877,  the  race  ended  as  a  dead  heat,  which 
we  treat  as  another  missing  value  in  the  sense  that  for  this  year  no  winner  could 
be  determined.  The  data  are  available  online  at  www.theboatrace.org  and  are 
plotted  in  Figure  5-8. 
5.4.2.1     A  GLMM  with  autocorrelated  random  eflfects 

Let  yt  —  lif  Cambridge  wins  at  year  t  and  yt  =  0  if  Oxford  wins,  where  t 
indices  the  148  years  the  race  took  place.  Conditional  on  an  autoregressive  random 
time  effect  Ut,  we  model  the  log  odds  of  a  Cambridge  win  at  time  t  as 


logit  (7rt(ttt))  =  a  +  pwt  +  Ut, 


(5.7) 
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where  Wt  is  the  difference  between  the  average  weight  of  the  Cambridge  crew 
and  the  Oxford  crew.  (The  boats  have  standard  size  and  weight.)  This  is  the 
only  covariate  available.  A  winner  in  one  year  is  also  likely  to  be  a  winner  in  the 
next  year  because  of  overlapping  crew  memberships,  rowing  techniques,  training 
methods,  experience  and  many  other  factors.  In  part,  each  outcome  reflects  the 
underlying  combined  efforts  leading  up  to  the  race.  We  propose  correlated  random 
effects  to  parsimoniously  characterize  the  variation  and  correlation  in  outcomes  due 
to  these  efforts.  That  is,  our  model  establishes  a  link  between  successive  winning 
probabilities  by  specifying  an  underlying  autoregressive  process  Ut+i  -  p'^'Ut  +  et  for 
the  random  effects,  where  dt  is  the  time  lag  (in  years)  between  two  successive  races. 
Figure  5-9  shows  the  dependency  in  the  data  by  plotting  a  smooth  estimate  s{h)  of 
the  the  sample  variogram  g{h)  for  lags  up  to  50  years.  The  most  important  feature 
is  the  strong  increase  for  the  first  few  lags  after  which  the  sample  variogram  levels 
off  at  a  constant  level. 

A  similar  impression  of  the  dependency  structure  in  this  data  set  is  obtained 
by  using  the  lorelogram,  shown  in  Figure  5  10.  For  each  lag  h,  the  odds  of  a 
Cambridge  win  are  estimated  by  cross-classifying  outcomes  h  years  apart.  E.g., 
the  first  three  values  LOR(l)  =  1.30,  L0R(2)  =  1.37  and  L0R(3)  =  0.74  in 
the  plot  are  the  log  odds  ratios  corresponding  to  the  following  contingency  tables, 
cross-classifying  outcomes  one,  two,  or  three  years  apart: 


lag  1 


0 


0 


lag  2 


42     24 


23     48 


0 


1 


0 


lags 


43     21 
24     46 


0 


0 


37    25 


29     41 


In  constructing  these  tables,  proper  care  must  be  taken  to  accommodate  the  years 
where  no  race  took  place.  Similar  to  the  variogram,  we  observe  a  sharp  decline  in 
the  log  odds  ratio  for  the  first  few  lags,  after  which  the  log  odds  ratio  level  off  at 
around  0.  Figure  5-10  also  shows  twice  the  asymptotic  standard  error  (ASE)  of  the 
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Figure  5-9:  Variogram  for  the  Oxford  vs.  Cambridge  boat  race  data. 
Comparison  of  a  smooth  estimate  of  the  sample  variogram  with  a  model  based 
estimate  of  the  variogram.  Triangles  represent  the  smooth  (natural  cubic  spline) 
estimate  s{h)  of  the  sample  variogram.  Squares  represent  the  model  based  estimate 
7(/i)  of  the  variogram.  Crosses  are  the  actual  values  of  the  sample  variogram  g{h). 
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Figure  5-10:  Lorelogram  for  the  Oxford  vs.  Cambridge  boat  race  data. 
Comparison  of  a  smooth  estimate  of  the  sample  lorelogram  with  the  model  based 
estimate  of  the  lorelogram.  Triangles  represent  the  smooth  (natural  cubic  spline) 
estimate  s{h)  of  the  sample  lorelogram.  Squares  represent  the  model  based  esti- 
mate 9{h)  of  the  lorelogram.  Crosses  are  the  actual  values  LOR{h)  of  the  sample 
lorelogram  and  the  grey  dotted  lines  represent  ±  two  times  the  ASE  of  the  log  odds 
ratio. 
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log  odds  ratio  at  each  lag,  calculated  from  the  observed  tables.  For  the  three  tables 
above,  the  ASEs  are  given  by  0.36,  0.37  and  0.35,  respectively. 

According  to  the  web-page  www.theboatrace.org,  "Boat  race  legend  has  it 
that  the  heavier  and  taller  crews  have  an  advantage  when  it  comes  to  race  day" . 
In  fact,  an  estimate  of  the  weight  effect  /9  under  the  assumption  of  independent 
outcomes  from  one  year  to  another  is  equal  to  0.056,  with  s.e.  equal  to  0.023.  This 
would  support  the  claim  that  the  heavier  crew  has  higher  odds  of  winning  the 
race.  E.g.,  we  estimate  that  a  5  pound  difference  increases  the  odds  of  winning 
by  32%.  However,  can  this  claim  still  be  supported  in  the  presence  of  dependent 
observations,  as  both  the  plot  of  the  variogram  and  the  lorelogram  suggest? 
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Table  5-5:  Maximum  likelihood  estimates  for  boat  race  data. 

a  0  a         p 

estimate:     0.27    0.079     2.65     0.68 
s.e.:     0.54     0.047     1.18     0.11 
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Figure  5-11:  Path  plots  of  fixed  and  random  effects  parameter  estimates  for  the 
boat  race  data. 

The  X-axis  shows  the  iteration  number  through  the  iterations  of  the  MCEM  algo- 
rithm. The  Monte  Carlo  sample  size  increased  in  112  iterations  from  100  at  the 
beginning  to  8260  random  draws  from  the  posterior  random  effects  distribution  at 
the  end. 

We  used  the  MCEM  algorithm  described  in  Sections  2.3  and  3.3  for  model 
(5.7)  to  obtain  the  maximum  likelihood  estimates  displayed  in  Table  5-5.  Standard 
errors  are  based  on  a  Monte  Carlo  approximation  of  the  observed  information 
matrix  using  50,000  samples  from  the  estimated  posterior  distribution.  Trace  plots 
for  parameter  estimates  are  pictured  in  Figure  5-11,  with  convergence  criteria 
similar  to  the  ones  mentioned  for  the  Old  Faithful  data:  ci  =  0.001,  c  =  4, 
£2  =  0.003,  €3  =  -0.005,  a  =  1.03  and  q  =  1.05  (cf.  Section  2.3.3).  Regular  GLM 
estimates  for  a  and  /3  and  a  =  2  and  p  =  0  were  used  as  starting  values  for  the 
MCEM  algorithm. 
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Based  on  Table  5-5,  we  estimate  the  conditional  odds  of  a  Cambridge  win 
to  increase  by  48%  for  every  5  pounds  the  average  Cambridge  team  member  (and 
there  are  9  on  a  boat  including  the  cox)  weights  more,  but  the  estimated  standard 
error  might  be  too  large  to  conclude  a  significant  effect.  A  quadratic  effect  of  the 
weight  difference  on  the  log  odds  was  found  to  be  insignificant.  The  significant 
correlation  of  0.68  for  races  one  year  apart  indicates  a  strong  dependency  between 
successive  outcomes.  This  reflects  the  fact  that  the  odds  of  winning  are  influenced 
by  factors  such  as  overlapping  crew  memberships,  training  methods,  motivation 
and  experience  from  previous  races.  The  conditional  estimate  for  /?  translates 
to  an  estimate  of  0.041  for  the  marginal  effect,  using  the  probit-logit  connection 
mentioned  in  Section  4.3.2.  Hence,  the  marginal  odds  of  winning  are  estimated  to 
increase  by  23%  (compared  to  32%  from  the  GLM  fit)  for  a  5  pound  difference  in 
the  average  weight,  when  we  properly  adjust  for  correlation  in  the  series.  Moreover, 
the  large  standard  error  of  fi  does  not  rule  out  the  possibility  of  no  effect  of  weight 
on  the  odds  of  winning.  -.    • 

5.4.2.2     Checking  the  fit  of  the  model  ,  :. 

Given  the  maximum  likelihood  estimates,  the  model-based  estimate  of  the 
variogram  is  '■'  '    ^  :* 

=    P{Yt  =  l)-P{Yt  =  l,Yt+H  =  l):   h  =  l,2,..., 

where  the  marginal  and  marginal  joint  probability  can  be  estimated  via  the  probit 
connection  or  by  integrating  over  the  one-  and  two  dimensional  random  effects 
distribution.  With  the  inclusion  of  a  time  varying  covariate  (difference  in  average 
crew  weight  Wt),  marginal  probabilities  vary  over  time.  We  assume  no  weight 
differences  (i.e.,  wt=(}  for  all  t)  for  the  calculation  in  this  section.  Then,  the 
estimated  marginal  probability  of  a  Cambrigde  win,  P{Yt  =  1),  or  two  Cambridge 
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wins  at  times  i  and  f  +  h,  P{Yt  =  1,  Yt+h  =  1)  do  not  depend  on  the  years  t 
and  t  +  h.  Figure  5-9  shows  the  model-based  estimate  of  the  variogram.  The 
agreement  with  the  empirical  variogam  is  good,  especially  for  the  most  important 
first  few  lags,  and  the  model  seems  to  capture  the  association  displayed  in  the  data 
appropriately. 

The  smooth  line  in  Figure  5  10  represents  the  model  based  estimates  of  the 

marginal  log  odds  ratio 

...  ^        (P{Yt  =  l,Yt^H  =  1)  X  PjYt  =  O.Yt^H  =  0) 
^  >       °^  \P{Yt  =  l,Yt+H  =  0)  X  P{Yt  =  0,Yt^H  =  1) 

for  observations  h  years  apart,  when  both  crews  are  of  equal  weight.  For  instance, 

for  races  one  year  apart  the  model  based  estimate  of  the  marginal  log  odds  ratio  of 

a  Cambridge  win  is  1.38,  approximated  via  the  logit-probit  connection.  That  is,  the 

odds  of  a  Cambridge  win  over  an  Oxford  win  are  estimated  to  be  4  times  higher 

if  Cambridge  had  won  the  previous  race  than  if  they  had  lost  it.  Naturally,  this 

factor  gets  smaller  the  greater  the  time  separation  between  two  races.  For  instance, 

at  lag  2,  the  odds  of  a  Cambridge  win  are  only  2.4  times  higher  if  they  had  won  the 

race  two  years  ago  rather  than  losing  it.  Based  on  Figure  5-10,  a  result  i  -  5  or 

more  years  in  the  past  hardly  has  any  influence  on  the  result  in  year  t.  That  is,  the 

odds  of  a  Cambridge  win  in  year  t  are  roughly  the  same,  whether  Cambridge  had 

won  or  lost  the  race  t  —  5  years  ago. 

Table  5-6  compares  observed  and  expected  counts  of  particular  sequences  of 

wins  and  losses,  again  assuming  no  weight  difference  between  the  two  crews.  Care 

must  be  taken  in  finding  all  possible  sequences  of  given  length  in  the  observed 

time  series  due  to  unequal  sampling  intervals.  For  instance,  with  the  specific 

pattern  of  no  races  in  certain  years  in  the  series  of  the  148  unequally  spaced 

observations,  only  137  sequences  of  two  consecutive  years  and  130  sequences  of 

three  consecutive  years  can  be  formed.  These  are  the  multipliers  for  the  estimated 
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probability  of  two  and  three  consecutive  outcomes,  respectively.  In  general,  the 
agreement  between  observed  and  predicted  counts  of  sequences  is  excellent.  The 
only  minor  discrepancy  seems  to  be  the  one  concerning  three  Cambridge  losses 
(or  equivalently  three  Oxford  wins)  in  a  row.  However,  in  constructing  this  table 
we  assumed  no  weight  difference  between  the  two  crews.  On  average,  over  the  148 
races,  the  weight  difference  between  the  Cambridge  crew  and  the  Oxford  crew  is 
-1.03  pounds,  i.e.,  Oxford  crews  are  heavier  on  average.  Out  of  the  33  races  Oxford 
had  a  weight  advantage  (i.e.,  was  heavier)  by  more  than  5  pounds,  it  won  21  races 
(64%).  Similarly,  out  of  the  25  races  Cambridge  had  a  weight  advantage  by  more 
than  5  pounds,  it  won  16  races  (64%).  Since  weight  seems  to  have  some  effect 
on  the  outcome  of  the  race  and  was  marginally  significant,  we  overestimate  the 
probability  of  three  Cambridge  losses  (or  underestimated  the  probability  of  three 
Oxford  wins)  by  assuming  no  weight  difference  for  all  three  races.  These  lead  to 
the  slightly  smaller  expected  count  than  the  one  observed  in  the  last  entry  of  Table 
5-6.  Factoring  an  average  weight  difference  of -1.03  pounds  for  all  three  races  into 
the  calculation  of  the  marginal  probability,  the  estimated  expected  count  for  three 
Cambridge  losses  is  27.9,  which  is  a  little  closer  to  the  observed  number  of  32. 
5.4.2.3     Prediction  of  random  eflfects 

In  traditional  GLMMs  the  prediction  of  a  univariate  random  effect  u  describ- 
ing an  exchangeable  correlation  structure  is  the  posterior  mean  E[u  \  y]  of  the 
distribution  of  the  random  effect  given  the  observed  data.  Similarly,  the  prediction 
of  the  random  process  u  =  (ui, . . . ,  ut)'  is  the  posterior  mean  E[u  \  y].  Draws  from 
the  last  iteration  of  the  MCEM  algorithm  can  be  used  to  approximate  the  posterior 
mean  by  a  Monte  Carlo  average.  Let  Ut  denote  the  t-th  component  from  this  ap- 
proximation. Then,  an  estimate  of  the  conditional  probability  of  a  Cambridge  win 

in  year  t  is  given  by 

„  ..  ,  exp{a  +  Pwt  +  ut} 

7rt(U()  = 


1  +  exp{a  +  $wt  +  Ut} 


151 


Table  5-6:  Observed  and  expected  counts  of  sequences  of  wins  (W)  and  losses  (L) 
for  the  Cambridge  University  team. 


expected  counts 
observed  counts     probit     Monte  Carlo 

^W^  T  77  [       791  78^ 

-L-  71  68.9  69.6 


148  148  148 


-w-w-    ■■•-■■ 

48 

50.5 

49.8 

-W-L- 

23 

22.8 

23.4 

-L-W- 

24 

22.8 

23.4 

-L-L- 

42 

41.0 

40.4 

137 

137 

137 

-W-W-W- 

34 

34.8 

34.1 

-W-W-L- 

13 

13.1 

13.2 

-W-L-W- 

12 

9.4 

9.8 

-W  L-L- 

10 

12.2 

12.2 

-L-W-W- 

11 

13.1 

13.3 

-L-W-L- 

9 

8.6 

8.9 

~L-L-W- 

9 

12.2 

12.3 

-L-L-L- 

32 

26.7 

26.1 

130  130  130 


•v 
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Table  5-7:  Estimated  random  effects  Ut  for  the  last  30  years  for  the  boat  race  data, 
year     result         Ut  year     result         Ut  year     result         ut 


1959 

L 

-0.49 

1974 

L 

-0.23 

1989 

L 

-2.35 

1960 

L 

-0.58 

1975 

W 

-0.12 

1990 

L 

-2.11 

1961 

W 

0.63 

1976 

L 

-1.31 

1991 

L 

-1.82 

1962 

W 

0.71 

1977 

L 

-2.07 

1992 

L 

-1.03 

1963 

L 

-0.19 

1978 

L 

-2.57 

1993 

W 

0.75 

1964 

W 

0.12 

1979 

L 

-2.76 

1994 

W 

1.63 

1965 

L 

-1.11 

1980 

L 

-2.85 

1995 

W 

2.05 

1966 

L 

-1.34 

1981 

L 

-2.83 

1996 

W 

2.16 

1967 

L 

-0.71 

1982 

L 

-2.78 

1997 

W 

2.08 

1968 

W 

0.85 

1983 

L 

-2.59 

1998 

W 

1.74 

1969 

W 

1.68 

1984 

L 

-2.35 

1999 

W 

1.07 

1970 

W 

2.11 

1985 

L 

-1.81 

2000 

L 

-0.10 

iC 

1971 

W 

2.11 

1986 

W 

-0.56 

2001 

w 

-0.08 

1972 

W 

1.76 

1987 

L 

-1.41 

2002 

L 

-1.38 

ri 

1973 

W 

1.05 

1988 

L 

-1.98 

2003 

L 

-1.82 

and  plotted  in  Figure  5-8.  It  seems  that  a  sequence  of  wins  pulls  the  estimated 
conditional  probability  towards  1,  and  a  couple  of  losses  pulls  it  towards  0.  The 
predicted  random  effects  for  the  last  45  years  are  displayed  in  Table  5-7  together 
with  the  outcome  for  these  races.  The  structure  of  the  estimated  random  effects 
reflects  the  dynamics  of  the  data:  Positive  random  effects  are  usually  associated 
with  a  Cambridge  win  and  negative  ones  with  an  Oxford  win.  The  magnitude 
of  the  predicted  random  effects  increases  (decreases)  the  closer  the  start  (end) 
of  a  sequence  of  wins  or  losses  for  one  team  is  in  sight.  For  example,  in  1993 
the  predicted  random  effect  is  0.75.  During  the  next  3  years,  all  of  which  are 
Cambridge  wins,  the  magnitude  of  the  predicted  random  effects  increases  steadily,      ^^ 
reflecting  the  increased  confidence  (as  measured  by  the  odds)  of  another  Cambridge 
win  due  to  past  results.  After  1996  the  predicted  random  effects  slowly  decline,  but 
still  show  a  preference  for  a  Cambridge  win.  In  2000  they  turn  negative  because  of 
an  Oxford  win  in  that  year.  For  2001  the  random  effect  rises  momentarily  because 
of  another  Cambridge  win,  but  declines  again  in  2002  and  2003  because  of  two 
consecutive  Cambridge  losses. 
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Part  of  this  phenomenon  can  be  explained  through  the  form  of  the  full 
univariate  conditional  distribution  of  ut  given  the  data  and  all  other  random 
effects.  Section  3.4.1  showed  that  this  distribution  depends  on  the  immediate 
predecessor  itt_i,  the  immediate  successor  Ut+i  and  on  the  outcome  yt  of  the  race. 
In  turn,  the  full  univariate  conditional  distribution  of  the  successor  Ut+i  directly 
depends  on  the  result  yt+i  of  that  race,  and  again  on  the  random  effects  before 
and  after  it.  In  this  way,  information  of  future  wins  or  losses  is  incorporated  in  the 
posterior  distribution  of  Ut.  For  example,  a  lost  race  in  the  near  future  results  in 
decreasing  predicted  random  effects  preceding  it,  reflecting  this  future  change  in 
momentum.  Incidentally,  the  distribution  of  the  random  effect  ut  at  the  boundary 
i  =  T  only  depends  on  its  immediate  predecessor  ut-i  and  on  yx. 

Using  the  prediction  rule  y^  =  1  if  n{ut)  >  0.5  and  yt  =  0  otherwise,  we 
are  able  to  compare  outcomes  predicted  by  our  model  to  observed  ones.  Of  the 
148  observations,  the  GLMM  model  with  autocorrelated  random  eflPects  only 
misclassifies  8  or  5.4%  as  the  opposite  outcome  compared  to  what  actually  was 
observed.  This  is,  of  course,  a  significant  improvement  over  predictions  based  on  a 
regular  logit  GLM  with  a  misclassification  rate  of  62  out  of  the  148  observations  or 
41.9%. 
5.4.2.4     Prediction  of  future  outcomes 

The  availability  of  marginal  joint  distributions  through  probit  or  Monte  Carlo 
approximation  allows  one  to  consider  marginal  conditional  distributions,  such  as  a 
one-step-ahead  forecast  distribution 


P{Yt+i  =  yr+i  \  Yt  =  yr,  Yt-i  =  yr-u  •  •  • ,  Yts  =  yr-s) 
^    P{Yt+i  =  yr+i,  YT  =  yT,.-,  Yts  =  yr-s) 
P{Yt  =  yT,---,  Yt-s  =  yr-s) 

for  an  outcome  at  time  T  -h  1,  factoring  in  the  past  s  +  1  observations,  s  = 
0, 1, 2, . . .  .  We  use  our  proposed  model  to  obtain  an  estimate  of  the  numerator 


(5.8) 
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and  denominator  in  (5.8).  The  two-stage  iiierarchy  together  with  the  autoregressive 
nature  of  the  random  effects  imply 

P{Yt+i  =  Vt+u  yT  =  yT,---,  Yt-s  =  Vt-s)  (5-9) 

=  /  ( n  ^(^' "  y*  I  "«^ )  9{ut^s)  I  n  ^(^*  I  "*-i) )  ^^^-'  •  •  •  ^^r  ^"^+1- 

"^      \t=T-s  J  \t=T-s+l  ) 

The  last  term  P(yr+i  =  2/r+i  I  mt+i)  in  the  first  product  is  given  by  an  extrapola- 
tion of  the  fitted  model  to  time  point  T  -I- 1,  i.e., 

P(yT+i  =  2/T+i  I  mt+i)  =  logit~^(x^+ii3  +  ur+i), 

where  xx+i  is  the  covariate  vector  for  time  point  T  4-  1.  It  is  estimated  by  using 
the  MLE  ^  for  /3.  The  last  term  c/(ttT+i  I  ^tr)  in  the  second  product  is  determined 
by  extrapolating  the  underlying  random  process  {tit}^i  to  time  point  T  +  1.  I.e., 
uy^^  =  p^tut  +  cr  where  dr  is  the  time  distance  between  points  T  and  T  -f- 1  and 
y(iXT+i  I  wt)  is  a  normal  distribution  with  mean  ptir  and  variance  0-^(1  —  p^'''^). 
For  the  boat  race  data,  a  Monte  Carlo  approximation  to  (5.9)  is  given  by 

J_Y>   TT    exp{t/t(Q;  +  ;gwt  +  Mi^^)} 
^  ^  }t-s  1  +  exp{a  +  M  +  «F^} ' 

where  n|-'^  is  the  ^th  component  from  the  j-th  generated  autoregressive  process  u^^^ 
(extrapolated  to  time  point  T  -I- 1)  with  variance  components  o^  and  p. 

The  estimated  forecast  probabilities  of  a  Cambridge  win  in  2004  based  on 
the  past  s  -f-  1  observations  and  the  GLMM  model  with  autocorrelated  random 
effects  is  given  in  Table  5-8.  For  example,  P(yi49  =  1  I  ^148  =  0),  the  conditional 
probability  of  a  Cambridge  win  in  2004,  given  the  outcome  of  the  race  in  2003 
(a  Cambridge  loss)  is  estimated  to  be  0.32.  We  assumed  a  zero  weight  difference 
lur+i  =  0  in  calculating  these  forecast  probabilities,  however,  any  value  can  be 
substituted  in.  (Historically,  the  crews  weigh  in  four  days  prior  to  the  race,  so  that 
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Table  5-8:  Estimated  probabilities  of  a  Cambridge  win  in  2004,  given  the  past  s  +  1 
outcomes  of  the  race. 

s:        0  1  2  3  4 5 6 

History:        L          LL       LLW     LLWL     LLWLW     LLWLWW    LLWLWWW 
W:     0.323     0.286    0.318      0.309        0.311  0.312 0.313 

The  first  row  displays  the  number  of  years  s  preceding  2003  which  are  conditioned 
on.  The  second  row  shows  the  history  of  Cambridge  wins  and  losses  from  2003  to 
2003-s.  The  third  row  displays  the  estimated  probabilities  P{Yi4g    =    1    ]    Yus    = 
2/148,  •  •  • )  ^148-s  =  2/i48-s)  of  ^  Cambridge  win  given  the  past  s  +  1  observations. 

wt+1  will  be  available  before  the  actual  outcome  of  the  race.)  Including  the  past 
two  results  of  2003  and  2002  (two  Cambridge  losses),  the  estimated  probability  of 
a  Cambridge  win  in  2004  decreases  to  0.29.  Considering  the  last  three  outcomes 
(two  Cambridge  losses,  one  Cambridge  win)  the  estimated  probability  is  again  0.32. 
Conditioning  on  outcomes  even  further  back  in  time,  the  estimated  probability  of 
a  Cambridge  win  stays  roughly  constant  at  0.31,  even  when  the  7  year  winning 
streak  of  Cambridge  in  the  years  1993  to  1999  is  factored  in.  It  seems  reasonable, 
however,  that  these  outcomes  do  not  impact  the  2004  outcome  in  terms  of  crew 
membership  and  training  methods  as  do  the  outcomes  of  2003  or  2002.  We  already 
mentioned  in  connection  with  the  interpretation  of  the  lorelogram  and  Figure  5-10 
that  results  five  or  more  years  in  the  past  hardly  seem  to  influence  the  current 
outcome. 

Considering  the  weight  factor,  if  the  average  crewman  on  the  Oxford  boat 
weights  more  by  5  pounds,  then  the  estimated  probability  of  a  Cambridge  win 
decrease  from  0.32  to  0.28  conditioning  the  outcome  of  the  last  race,  and  from  0.29 
to  0.24  when  factoring  in  the  last  two  outcomes.  On  the  other  hand,  if  Cambridge 
holds  a  weight  advantage  of  5  pounds  in  2004,  their  predicted  winning  probabilities 
are  0.37  and  0.33,  respectively. 

One  special  case  of  the  derivations  above  is  to  not  condition  on  any  of  the  past 
outcomes  and  just  look  at  the  marginal  probability  of  a  Cambridge  win  in  2004, 


.■i^. 
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which  for  general  T  is  given  by 

P{Yt+i  =  1)  =     /   T^T+\{uT+l)9{uT^l)dUT+\, 

where 

logit(7^r+l(^tr+l))  =  a  +  ^wt+i  +  wt+i, 

and  mt+1  has  a  marginal  /^(O,  cr^)  distribution.  However,  this  estimator  does  not 
factor  in  any  past  information.  For  no  weight  difference  {wt+\  —  0)  it  is  calculated 
to  be  0.52. 

The  above  predictions  are  based  on  marginal  probabilities,  where  random 
effects  have  been  integrated  out.  Another  way  of  predicting  the  probability  of  a 
future  outcome  uses  the  conditional  model  directly,  incorporating  an  estimate  of 
the  random  effect  at  time  T  +  1.  With  the  autoregressive  nature  of  the  random 
effects,  the  minimum  mean-squared  error  predictor  of  Ut+i  is  given  by  m^^^  = 
E[ut+i  I  ut]  =  pUT-  To  estimate  ttx+i>  we  use  the  posterior  distribution  of  the 
random  effects  given  the  observed  data,  i.e.. 


u. 


T+l 


E  [E[ut+i  I  Ut]  I  y]  =  E[puT  \y\  =  fmr 


and  plug  in  the  maximum  likelihood  estimate  of  p.  This  is  consistent  with  the  way 
random  effects  are  predicted  in  spatial  GLMMs  as  described  by  Zhang  (2002).  He 
proved  the  following  theorem  under  the  assumption  of  known  fixed  and  random 
effects  parameters.  Here,  we  adapted  his  theorem  to  our  time  series  context  to 
facilitate  prediction  of  unobserved  intermediate  and  future  outcomes: 

Theorem:  Let  Ufc,  fc  G  Z+  be  Gaussian  with  E[uk]  =  0  for  all  k.  If  conditionally 
on  {ufc,  k  G  Z+},  {Ffc,  A;  G  Z+}  are  independent  and  for  each  k  the  distribution  of  Yk 
depends  on  Uk  only,  then  for  any  A;  and  observed  time  points  <i,  <2,  •  •  • ,  ^t, 

T 

E[u,\y\  =  Y,c,E[ut,\y],  (5.10) 


t=i 
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where  the  coefficients  Cj  are  such  that  E[uk   \  ut^,. . .  ,UtT,]  =  Xlt=i  ^»^ti  ^^^ 
y  =  (j/tj, .  ..,ytT)  are  the  observations  at  time  points  ti,...,tT- 

Proof:  (An  adaptation  from  Zhang,  2002.)  Let  /c  G  Z+  and  A;  7^  U  for 
alH  =  1, . . . ,  T.  Let  /(ufe,  w,  y)  denote  the  joint  density  of  (u^,  u,  y),  where 
u  =  (ii(i, . . . ,  titj.)  holds  the  random  effects  at  the  observed  time  points.  The 
distribution  of  the  observed  data  depends  on  the  random  effects  at  the  observation 
time  points,  but  no  other  random  effects,  hence  f{y  \  Uk,  u)  =  f{y  \  u).  Then, 

f{uk,u,y)    =    f{y\uk,u)f{uk,u) 
=    f{y\u)f{uk,u) 
=    f{u,y)f{uk  \u). 

Dividing  both  sides  by  f{u,y)  we  obtain  f{uk  |  w,y)  =  f{uk  \  u)  and  consequently 
E[uk  I  M,y]  =  E[uk  I  u]  =  J2j=i^i'^u  for  some  appropriate  constants  q.  By  the 
properties  of  repeated  expectation  (i.e.,  Ex\y[X  \  y]  =  Ez\y  [Ex\z,y[X  \  z,y]\  z\) 
we  have  E  [E[uk  \u,y]\y]  =  E[uk  |  y]  and  (5.10)  follows. 

For  the  boat  race  data,  we  could  use  (5.10)  to  get  a  prediction  of  the 
probability  of  an  outcome  in  a  year  where  no  race  took  place.  Le.,  choose  a 
k  <T,k  :/^ti,...tT,  where  ti,...tT  are  the  years  a  race  took  place.  (For  clarity,  we 
now  denote  the  set  of  years  where  a  race  took  place  as  ti,  ^2,  •  •  • ,  ^r,  where  h  is  the 
year  of  the  first  race  in  1829,  ^2  is  the  year  of  the  second  race  in  1836  and  tr  is  the 
year  2003.)  More  importantly,  we  can  use  (5.10)  to  predict  future  outcomes.  For 
instance,  by  setting  k  =  tT  +  1,  i.e.  to  the  year  2004,  we  obtain 

148 

tij,^^  =  E[ut^+i  I  y]  =  '^CiE[ut,  I  y]  =  pE[ut^  \  y]  =  putr 

»=i 

as  the  prediction  for  the  random  effect  for  that  year,  the  same  result  we  derived 
before.  Here,  we  made  use  of  (5.10)  and  the  autoregressive  nature  of  the  random 
effects  which  imply  that  E[utj.^^  \  Ut^,..  .,UtT]  =  putj.,  i.e.,  ci  =  . . .  =  ct-i  =  0 
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and  ct  =  p.  The  prediction  can  be  evaluated  by  plugging  in  the  MLE  for  p  and 
using  the  Monte  Carlo  sample  from  the  last  iteration  in  the  MCEM  algorithm 
to  approximate  the  posterior  mean.  The  prediction  for  a  random  effect  s  years 
into  the  future  (and  hence  the  prediction  of  the  distribution  for  a  future  outcome 
Ytj._^_^  under  the  assumed  model)  can  be  obtained  similarly.  E.g.,  with  k  =  tr+a, 
E[ut^+s  I  y]  -  E!=i CiE[uu  I  y]  =  p'E[ut^  I  y],  since  E[ut^^,  \  ut„ . .  .,ut^]  =  p'ut^, 
i.e.,  ci  =  . . .  =  ct-1  —  0  and  ct  =  p*. 

For  the  boat  race  data,  the  estimated  random  effect  for  2004  is  Ui^q  = 
0.69  X  -1.82  =  -1.26.  Then,  according  to  our  model,  the  estimated  probability  of 
a  Cambridge  win  in  2004,  given  a  prediction  for  the  random  effect  in  that  year  is 
0.26. 
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CHAPTER  6 
SUMMARY,  DISCUSSION  AND  FUTURE  RESEARCH  v^ 

.t 

In  this  dissertation  we  proposed  autocorrelated  and  other  correlated  random  ■  ;' 

effects  in  GLMMs  as  a  mean  of  introducing  and  modeling  correlation  in  regression  -' 

models  for  series  of  unequally  spaced  counts  or  binary/binomial  observations.  In 
Chapter  1  we  contrasted  our  regression  approach  with  one  based  on  modeling  ■ 

the  mean,  variance  and  covariance  directly  (marginal  models),  and  another  one 
based  on  regressing  previous  observations  and  covariates  on  the  current  response 
(transitional  models).  At  the  cost  of  increased  computational  time  and  complex 
algorithms,  inferential  procedures  for  GLMMs  are  based  on  the  joint  likelihood  of 
the  T  observations  2/1, . . . ,  j/t-  In  contrast,  marginal  models  are  based  on  a  quasi- 
likelihood  approach  and  estimation  in  transitional  models  relies  on  conditionally  or 
partially  specified  likelihoods  that  do  not  represent  the  full  joint  distribution.  In 
particular,  constructing  tables  such  as  5-4  and  5-6  that  compare  observed  counts 
to  marginal  predicted  counts  of  sequences  of  events  is  impossible. 

In  Chapter  2  we  presented  a  general  MCEM  algorithm  for  fitting  GLMMs  and 
derived  specific  algorithmic  details  for  equally  and  autocorrelated  random  effects  in 
Chapter  3.  There,  we  gave  details  on  the  implementation  of  a  full  iterative  M-step 
as  opposed  to  just  a  single  iteration.  We  also  focused  on  the  Gibbs  sampler  as 
a  means  of  sampling  from  the  posterior  distribution  of  the  random  effects  given 
the  observed  data  as  required  for  the  approximation  of  the  E-step  and  posterior 
predictions  of  the  random  effects.  Other  MCMC  methods  such  as  a  Metropolis- 

Hastings  algorithm  can  also  be  employed,  however,  the  Gibbs  sampling  approach  -    "i 

li 
reduced  to  simple  forms  for  autoregressive  random  effects  and  was  relatively  fast  j 

in  implementations.  Furthermore,  the  structure  of  the  full  univariate  conditionals 
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made  clear  how  the  correlations  between  random  effects  serve  as  building  blocks  for 
the  correlation  between  the  time  series  observations. 

The  first  graph  in  Figure  6-1  shows  the  exchangeable  correlation  structure 
among  Yt's  implied  by  the  traditional  GLMM  assumption  of  one  common  random 
effect  to  all  observations.  All  vertices  (where  a  vertex  corresponds  to  a  random 
variable)  which  are  not  joined  by  an  edge  are  conditionally  independent.  The 
second  graph  shows  the  same  picture  for  autoregressive  random  effects.  Note  that 
a  marginal  dependency  between  Yt  and  Yt-i  is  induced  through  the  path  via  Ut  and 
ut-i.  (There  is  no  edge  between  Ut-i  and  Ut+i  if  we  assume  a  lag  1  autoregressive 
process.)  The  third  graph  is  slightly  different  in  nature  and  pictures  the  structure 
of  the  full  univariate  conditional  distribution  of  Ut,  given  the  other  random  effects 
Ut-  and  the  data  y.  As  we  showed  in  Section  3.4,  the  conditional  distribution  of  Ut 
depends  on  its  predecessor  ut-i,  its  successor  Ut+i  and  on  the  current  observation 
Yt.  In  turn,  Ut-i  and  Ut+i  depend  on  the  observation  at  times  t  -  1  and  t  +  1, 
respectively.  Thus,  the  posterior  of  Ut  incorporates  information  of  past  and  future 
responses  and  random  effects. 

Although  autocorrelated  random  effects  in  GLMMs  are  not  new  to  the 
literature  (e.g.  Chan  and  Ledolter,  1995),  we  extended  their  application  by 
explicitly  allowing  for  gaps  in  the  observed  time  series  through  specifying  the 
correlation  in  the  AR(1)  process  in  terms  of  a  lag.  This  allowed  us  to  handle 
missing  data  in  the  series  without  any  additional  procedures  and  adjustments  to 
likelihood  inference.  In  some  instances,  predicting  responses  at  times  where  no 
responses  (and  covariates)  were  observed  could  be  a  potential  goal.  We  presented 
some  theory  for  intermediate  prediction  with  the  Oxford  vs.  Cambridge  boat  race 
data.  By  the  way  random  effects  incorporate  information  on  previous  and  future 
observations,  our  regression  models  should  be  well  suited  to  predict  intermediate 
observations  not  observed  at  certain  time-points. 
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Figure  6-1:  Association  graphs  for  GLMMs. 
The  first  two  diagrams  represent  the  associations  among  observations  Yx,...,Yt  in 
GLMMs  with  one  common  random  effect  and  autocorrelated  random  effects  {ut}. 
Vertices  not  connected  by  edges  are  conditionally  independent.  The  last  graph  rep- 
resents the  association  structure  for  the  posterior  distribution  of  Ut  given  the  other 
autocorrelated  random  effects  and  the  data.  Influence  of  covariates  is  not  shown. 

Chapter  4  was  devoted  to  derive  marginal  properties  of  our  time  series 
regression  models  based  on  normal,  Poisson,  negative  binomial  and  binomial 
distributional  assumptions  on  the  observations.  We  saw  that  conditionally  specified 
GLMMs  lead  to  marginal  overdispersion  relative  to  the  normal,  Poisson,  negative 
binomial  or  binomial  variance,  and  we  gave  formulas  for  their  expressions  in  the 
case  of  correlated  random  effects  suggested  here.  More  importantly,  we  derived 
expressions  for  the  marginal  correlations  between  any  two  members  of  the  time 
series  implied  by  our  models.  While  in  the  case  of  normal,  Poisson  and  negative 
binomial  time  series  regression  models  these  have  closed  forms,  approximations 
have  to  be  used  in  the  binomial  case  with  a  logit  link.  We  explored  several  options 
and  presented  an  approximation  based  on  the  similarity  between  logit  link  and 
probit-link  models  to  evaluate  marginal  properties.  The  derived  formulas  for 
marginal  means  and  correlations  are  useful  for  comparing  empirical  quantities  to 
model-based  quantities,  such  as  a  comparison  of  observed  and  predicted  counts, 
observed  and  predicted  autocorrelations  or  observed  and  predicted  log  odds  ratios 
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in  a  time  series.  These  are  important  aspects  in  determining  the  appropriateness  of 
the  random  effects  distribution  and  the  goodness  of  fit  of  the  model  in  general. 
Applications  of  the  MCEM  algorithm  developed  in  Chapters  2  and  3  and 
the  theory  developed  in  Chapter  4  are  given  in  Chapter  5.  Examples  of  binomial, 
binary  and  count  time  series  were  presented  and  modeled  within  the  proposed 
framework  of  GLMMs  with  autoregressive  random  effects  (ARGLMMs).  For  the 
binary  case,  we  explored  certain  symmetries  the  model  implies  in  the  marginal 
distribution  when  no  time-varying  covariates  are  observed.  Also,  some  theory  on 

K'^ '  .  •  predicting  future  events  in  binary  time  series  based  on  the  conditional  model  or  the 
implied  marginal  model  was  developed.  Some  results  of  these  data  analyses  will  be 
discussed  in  the  next  two  sections. 

6.1      Cross-Sectional  Time  Series 
We  motivated  the  usefulness  and  appropriateness  of  our  methodology  through 
the  analysis  of  a  cross-sectional  binomial  time  series  from  one  of  the  largest  US 
data  sets  for  social  science  research.  Scientists  making  use  of  this  data  base  and 
who  would  like  to  analyze  developments  of  count  or  binomial  responses  through 

''-'!■''!'  time  should  consider  the  methods  described  here  because  they 

•  address  the  temporal  dependence  in  the  observations  over  the  years, 

•  address  the  cross  sectional  dependence  within  a  year,  and 

•  naturally  handle  gaps  in  the  observed  time  series. 
We  showed  analysis  of  such  data  by  assuming  an  approximate  normal  distribution 
for  the  log  odds  and  fitting  of  a  corresponding  linear  mixed  effects  models.  We 
made  adjustments  by  appropriately  weighting  the  log  odds  to  more  closely  meet 
the  assumptions  of  a  normal  linear  mixed  model.  However,  we  also  presented  the 
analysis  based  on  the  true  binomial  nature  of  the  observations,  using  a  logistic 
ARGLMM.  We  consider  this  a  better  approach,  particular  if  the  binomial  sample 
sizes  are  small,  because  it  allows  the  variance  of  the  log  odds  to  vary  as  a  function 
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of  the  mean.  In  time  series  models,  the  mean  often  displays  trend  behavior,  and 
therefore  the  assumption  of  constant  variance  is  inappropriate.  As  to  how  much  the 
adjustments  by  weighting  the  log  odds  with  their  estimated  asymptotic  standard 
deviation  in  the  normal  approximation  model  simultaneously  alleviates  all  these 
problems  remains  doubtful.  The  ARGLMM  can  be  fit  using  the  MCEM  algorithm 
outlined  in  Sections  2  and  3  and  possesses  the  three  features  mentioned  above. 
Furthermore,  using  the  marginal  results  for  binomial  time  series  discussed  in 
Section  4.3,  these  models  allow  for  both  overdispersion  relative  to  the  binomial 
variance  and  correlation  between  successive  observations. 

Data  from  the  General  Social  Survey  are  not  the  only  application  of  our 
methods.  In  political  science,  especially  in  international  relations,  annual  binary 
time  series  cross-sectional  data  are  very  common.  For  instance,  a  lot  of  research 
focuses  on  the  analysis  of  the  relationship  (conflict/no  conflict)  between  two  states 
over  a  long  period  of  years.  Ad-hoc  methods  such  as  including  the  residuals  from 
a  preliminary  logit  analysis  in  the  linear  predictor  (Oneal  and  Russett,  1997)  are 
proposed  to  adjust  for  the  temporal  dependence.  More  sophisticated  methods  treat 
the  binary  time  series  as  grouped  time  to  event  (or  survival)  data,  and  include 
temporal  dummy  variables  in  the  Unear  predictor  of  a  logit  model.  These  dummy 
variables  mark  the  number  of  years  in  between  two  events  (i.e.,  conflicts)  of  the 
binary  time  series  and  are  motivated  by  a  relationship  between  Cox's  proportional 
hazard  model  for  time  to  event  data  and  logit  models  (Beck,  Katz  and  Tucker, 
1998).  However,  one  drawback  of  these  methods  is  that  usually  we  observe  several 
events  (e.g.,  conflicts)  over  time.  In  particular.  Beck  et  al.  (1998)  treat  the 
probability  of  subsequent  events  as  independent  from  the  first  one.  This  is  a  much 
stricter  (and  often  unrealistic)  assumption  than  the  conditional  independence 
assumption  in  ARGLMMs. 
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Furthermore,  by  including  dummy  variables  (or,  as  also  suggested,  a  natural 
cubic  spline  version)  in  the  linear  predictor  to  induce  dependency,  the  nature  of  this 
dependency  cannot  be  modeled.  Beck  et  al.  (1998)  note  that  "temporal  dependence 
cannot  provide  a  satisfactory  explanation  [of  conflict]  by  itself,  but  must,  instead, 
be  the  consequence  of  some  important,  but  unobserved,  variable."  Hence,  the 
GLMMs  with  the  assumption  of  a  latent  autoregressive  random  process  developed 
in  this  dissertation  seem  to  be  a  natural  approach  of  analyzing  binary  time  series 
cross-sectional  data  and  are  an  attractive  alternative  to  the  widely  used  methods 
suggested  by  Beck  et  al.  (1998). 

6.2     Univariate  Time  Series 

Apart  from  cross  sectional  time  series  data,  we  focused  on  the  analysis  of 
a  single  time  series  of  counts  or  binary  observations  in  Chapter  5.  Standard 
loglinear  or  logit  analysis  ignoring  the  serial  dependence  may  result  in  misleading 
inference.  This  was  evident  for  the  Polio  data  of  Section  5.3,  where  we  showed 
that  the  ARGLMM  adequately  captured  the  correlation  structure  for  the  residuals, 
which  was  not  the  case  for  other,  more  standard  models.  Hence,  these  models,  an 
ordinary  Poisson  GLM,  a  negative  binomial  GLM,  and  a  Poisson  GLMM  showed 
strong  evidence  of  a  time  trend  (see  Table  5-2),  whereas  the  evidence  seems  to  be 
considerably  weaker  when  the  correlation  is  accounted  for.  Similarly,  for  the  boat 
race  data,  we  were  able  to  correctly  quantify  a  common  belief  about  the  influence 
of  weight.  Ignoring  the  correlation  in  the  series  of  wins  and  losses  would  have  led  to 
an  overstatement  of  the  influence  of  weight,  one  that  is  not  seen  as  strong  once  the 
analysis  takes  the  serial  correlation  into  account. 

We  emphasized  model  checking  through  residual  analysis  with  a  comparison  of 
empirical  and  theoretical  autocorrelations,  lorelograms  and  variograms  in  the  case 
of  unequally  spaced  observations.  For  the  Polio  data,  we  calculated  residuals  for  all 
entertained  models  and  showed  that  only  the  autocorrelation  function  implied  by 
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an  ARGLMM  mimics  the  one  observed  in  the  residuals.  Residual  autocorrelations 
from  other  models  ignoring  the  serial  dependence  showed  non  conformity  with  the 
model  specifications.  For  the  Old  Faithful  data  set  we  observed  good  agreement     ^       ; 
between  the  ARGLMM  implied  marginal  autocorrelation  function  and  the  empir- 
ical one.  Similarly,  the  estimated  and  empirical  variogram  and  lorelogram  for  the  y, 
boat  race  data  showed  good  agreement,  indicating  a  reasonable  assumption  on  the 
model-implied  dependency  structure.  This  was  further  justified  by  a  comparison                  .^ 
of  observed  and  (marginally)  predicted  counts  of  sequences  of  wins  and  losses,    .  , 
which  we  approximated  using  either  the  connection  between  the  exact  marginal  ":i, 
expressions  for  probit  models  or  Monte  Carlo  approximation. 

6.2.1  Clipping  of  Time  Series 

The  application  of  our  regression  models  to  the  analysis  of  binary  time  series  :'  ■  -? 

and  the  methodology  developed  here  may  be  broader  than  initially  realized.  Let  Yt 
be  a  binary  time  series  obtained  by  clipping  (Kedem,  1980)  an  underlying  process 
Zt  such  that 

Yt  =  I[ZteC],  ,-:■:: 

where  /  is  the  indicator  function  which  is  1  if  Zt  is  in  the  set  C  and  0  if  it  is  in  its 

complement.  Estimation  {t  <  T)  and  prediction  {t  >  T)  of  7r(«t)  =  P{Yt  —  1  \  Ut) 

using  an  ARGLMM  with  covariates  Xt  then  entails  estimation  and  prediction  of  ,v 

the  event  {Zt  G  C}  based  on  covariate  information.  This  might  be  useful  in  a 

variety  of  settings,  e.g.,  when  an  investigator  is  forced  to  or  more  comfortable  with 

dichotomizing  the  observed  data. 

6.2.2  Longitudinal  Data 

The  type  of  time  series  data  we  consider  in  this  dissertation  are  different  from 
longitudinal  or  panel  data,  which  usually  consist  of  only  a  few  repeated  observa-  •*; 

tions  (i.e.,  T  is  small,  often  less  than  5),  but  with  a  large  number  of  replications.  It 
is  doubtful  if  the  estimation  techniques  (such  as  GEE)  developed  for  longitudinal,  ' 
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especially  interdependent  binary  data  are  useful  in  our  context  of  very  long  time 
series  data,  since  the  temporal  dependence  is  much  richer.  For  instance,  while  it 
was  possible  to  fit  a  marginal  model  via  GEE  for  the  two  binomial  time  series  of 
16  observations  each  in  the  data  about  homosexual  relationships,  we  could  not  fit 
any  other  of  the  time  series  mentioned  in  Chapter  5  with  GEE  methodology.  The 
reason  is  that  solving  the  estimating  equations  requires  inversion  of  the  T  x  T 
covariance  matrix  for  (2/1, . . . ,  t/r)-  In  longitudinal  studies,  this  matrix  is  also  large 
but  has  block-diagonal  structure  with  low-dimensional  blocks  corresponding  to  the 
few  repeated  measurements  within  a  cluster.  For  the  GEE  analysis  of  the  polio 
count  data,  Zeger  (1988)  proposed  approximating  the  T  x  T  covariance  matrix 
of  (yi,  •  •  • ,  Vt)  with  a  simpler  band-diagonal  matrix  corresponding  to  an  autore- 
gressive  process,  which  then  has  an  easy  inverse.  Also,  note  that  in  the  case  of  a 
single  time  series  (i.e.,  a  single  cluster),  the  usual  approach  of  adjusting  estimated 
standard  errors  by  using  the  sample  covariance  matrix  as  a  robust  estimate  of  the 
correlation  matrix  is  not  applicable.  This  is  because  the  usual  sum  over  clusters  in 
the  expression  of  the  robust  asymptotic  covariance  matrix  reduces  to  a  single  sum- 
mand,  which  is  the  score  equation  that  was  set  equal  to  0.  Furthermore,  the  GEE 
approach  does  not  yield  estimates  of  multivariate  probabilities,  so  that  construction 
of  tables  such  as  5-4  and  5-6  comparing  observed  counts  to  marginal  predicted 
counts  of  sequences  to  evaluate  goodness  of  fit  is  impossible. 

6.3     Extensions  and  Further  Research 

Several  extensions  of  the  proposed  methodology  are  possible,  opening  new  lines 
of  research.  Following,  we  give  a  brief  overview  of  some  ideas. 
6.3.1      Alternative  Random  Effects  Distribution 

We  focused  on  latent,  first-order  autoregressive  random  effects  processes  for 
describing  time  series  observations  in  a  GLMM  framework,  but  extensions  to 
pth-order  processes  are  possible.  The  derivations  of  the  MCEM  algorithm  and  the 
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Gibbs  sampler  should  be  similar,  where  the  conditional  distribution  of  Ut  now  will 
depend  on  its  2p  neighbors.  More  generally,  other  random  effects  distributions 
(possibly  improper)  can  be  explored.  For  example,  our  AR(1)  process  is  a  special 
case  of  a  generalized  autoregressive  random  effects  process  Ut  =  pX1j=i  ^J'^'j  +  ^f' 
which  was  proposed  by  Ord  (1975).  With  appropriately  specified  constants  Qj,  one 
alternative  to  AR(1)  (or  AR(2))  random  effects  is  to  let 

ui    =    pU2  +  ei 

Ut    =    p{ut-i  +  Ut+i)  +  et,  t  =  2, . . .  ,T  -  1 

Ut     —     pUT-l  +  ct- 

For  this  case,  Sun,  Speckman  and  Tsutakawa  (2000)  mention  that  the  full  uni- 
variate conditional  distribution  of  Ut  depends  on  {ut-2,ut-i,ut+i,Ut+2)  for 
3  <  t  <  T  -  2,  and  similar  interpretations  as  given  above  regarding  the  ran- 
dom effects  as  building  blocks  for  the  correlation  in  the  time  series  apply.  As 
mentioned  in  Section  1.4.1,  models  with  more  complicated  random  effects  struc- 
tures are  often  fit  in  a  Bayesian  framework,  assuming  noninformative  priors  on  the 
fixed  effects  and  variance  components.  In  that  setting,  propriety  of  the  posterior 
distribution  cannot  be  guaranteed  for  a  Poisson  GLMM  when  one  of  the  observed 
counts  is  zero,  and  is  impossible  in  a  logit  link  GLMM  for  binomial  observations 
if  they  are  equal  to  0  or  ut  for  just  one  t.  (See  theorem  4.1  and  examples  4.1  and 
4.2  in  Sun,  Speckman  and  Tsutakawa  (2000).)  Hence,  with  noninformative  (flat) 
priors  on  fixed  effects  and  variance  components  of  the  correlated  random  effects 
distribution,  Bayesian  GLMMs  for  binary  time  series  result  in  improper  posteriors. 
Further  developing  the  models  and  methods  presented  in  this  dissertation  would 
therefore  be  a  worthwhile  goal. 
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6.3.2     Topics  in  GLMM  Research 

Goodness  of  fit  measures  for  GLMMs  remain  an  active  area  of  research. 
We  tried  to  propose  some  methods  here  based  on  a  comparison  of  observed 
and  marginally  fitted  counts,  however,  no  formal  statistic  was  developed.  Part 
of  the  problem  is  that  GLMMs  can  not  easily  be  made  a  special  case  of  some 
broader  model  (such  as  the  saturated  model  for  GLMs)  and  compared  to  it. 
Recently,  Presnell  and  Boos  (2004)  suggested  a  test  for  model  misspecification 
based  on  comparing  the  maximized  likelihood  of  a  model  to  one  motivated  by  a 
cross-validation  approach  where  observations  are  deleted  sequentially.  It  would 
be  interesting  to  see  how  this  applies  to  GLMMs,  although  the  computational 
complexity  due  to  refitting  the  model  several  times  might  be  a  huge  burden. 

Similar  computational  costs  arise  when  we  try  to  determine  if  GLMMs  with 
autoregressive  random  effects  are  useful  for  prediction.  Section  5.4.2  presented 
some  theory  of  predicting  future  outcomes  in  binary  time  series,  but  more  work  is 
needed  on  cross-validation,  misclassification  rate  or  similar  measures  of  the  quality 
of  the  model  for  prediction.  From  the  examples  we  have  seen  in  this  dissertation, 
it  appears  that  the  estimate  of  the  standard  deviation  of  the  latent  autoregressive 
process  is  rather  large,  which  would  lead  to  wide  prediction  intervals  on  the  logit 
and  original  probability  scale. 

A  lot  of  recent  work  has  focused  on  transitional  models  for  categorical  time 
series  with  more  than  two  categories  (e.g.,  Fokianos  and  Kedem,  2003).  We  believe 
that  a  multivariate  GLMM  approach  (Agresti,  2002;  Fahrmeir  and  Tutz,  2001)  with 
carefully  specified  correlated  random  effects  (univariate  or  multivariate)  might  be 
an  alternative  worth  studying. 

Lastly,  although  we  focused  on  analyzing  a  single  time  series,  it  should  be 
straightforward  to  extend  our  methodology  to  the  analysis  of  several,  independent 
time  series  (e.g.,  one  for  each  subject  in  a  longitudinal  study  with  a  large  number 
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of  repeated  observations),  as  was  alluded  to  throughout  Chapters  2  and  3  of  this 
dissertation.  Unlike  GEE,  our  methodology  with  special  regard  to  unequally  spaced 
time  series  should  be  practical  when  not  all  subjects  are  measured  at  common  time 
points  and,  a  very  realistic  assumption,  subjects  skip  certain  time  points.  Smith 
and  Diggle  (1998)  propose  a  GEE  approach  for  estimating  fixed  effects  parameters 
in  such  circumstances,  coupled  with  a  complicated  pseudo-likelihood  approach  that 
assumes  independence  for  estimating  the  variance  components.  We  would  Uke  to 
extend  our  proposed  likelihood  framework,  jointly  estimating  all  parameters,  to  this 
situation. 
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